Main Website
Scraping
Web Scraping
Updated on
March 25, 2024

How to Use Puppeteer Stealth For Advanced Scraping?

Ever wondered how industries seamlessly gather data from the digital ocean that is the internet? How do businesses and developers navigate the complexities of today’s web to extract valuable insights efficiently? The answer lies in the art of web scraping.

But here’s the challenge: as the web advances, so do the defenses against data extraction. How can one overcome the barriers set by anti-scraping measures? Enter Puppeteer Stealth, a tool not only used for scraping but also for extracting information with ease. In this article, we’ll explore how Puppeteer Stealth works and how to configure it for advanced scraping tasks.

What is Puppeteer Stealth?

Puppeteer Stealth, also known as puppeteer-extra-plugin-stealth, serves as an extension built on top of Puppeteer Extra - a powerful library for controlling headless browsers. This plugin employs various techniques to hide properties that could signal your scraping activities as automated bot behavior. The goal is to make it more challenging for websites to detect your scraper and ensure a smoother, undetected data extraction process.

How does Puppeteer Stealth work?

Here’s a breakdown of the key mechanisms of Puppeteer Stealth:

Browser fingerprint modification

Puppeteer Stealth smartly adjusts the fingerprints your browser leaves behind as you surf the web. Think of fingerprints like a digital ID that websites use to tell users apart. The Stealth plugin changes these fingerprints dynamically. It’s like giving your browser a digital disguise, tweaking the unique features websites usually rely on to spot automated bots. By doing this digital makeover, Puppeteer Stealth sidesteps the tricks websites use to catch bots in the act. This means when Puppeteer Stealth talks to a website, it does so with a changed identity, making it less likely to get flagged as a bot.

Human-like behavior mimicry

Puppeteer Stealth not only automates tasks, it acts like a real person on the web. The Stealth plugin mimics how a human interacts with a webpage. Imagine it copying the small details, like how your mouse moves and the patterns of your clicks. This isn’t just data collection; it’s about doing it in a way that looks like a genuine person engaging with a site. By doing this, Puppeteer Stealth enhances its ability to operate incognito, seamlessly blending with normal user behavior and reducing the risk of triggering anti-scraping measures.

Installing Puppeteer Stealth

Let’s discuss the step-by-step process of setting up Puppeteer Stealth for advanced web scraping.

Step 1: Installing Puppeteer and Puppeteer Stealth

Make sure Node.js is installed on your machine, and then run the following commands in your terminal:


npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

Step 2: Configuring Puppeteer with Stealth Plugin

Integrate Puppeteer with the Stealth plugin in your script. Here’s a code snippet that demonstrates how to do it:


const puppeteer = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(stealthPlugin());
puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  // Scraping logic here
  await browser.close();
});

This code sets up Puppeteer with the Stealth plugin, enhancing its capabilities for discreet scraping.

Step 3: Configuration options for advanced scraping

Puppeteer Stealth offers various configuration options to tailor your scraping experience. Here are a few key ones:

  • stealth: true - Enable the Stealth plugin
  • headless: false - Run the browser in non-headless mode for a more human-like appearance.
  • ignoreHTTPSErrors: true - Ignore HTTPS-related errors for a smoother scraping experience.
  • userAgent: 'your-user-agent-string' - Customize the User-Agent string for dynamic User-Agent rotation.

Setting up Puppeteer Stealth for advanced scraping

Web scraping in Puppeteer Stealth goes beyond basic scenarios. Let’s first explore a basic example and then dive into advanced use cases.

Basic web scraping 

The basic web scraping involves using Puppeteer Stealth to navigate to a website, interact with elements, and extract information.

Setting up Puppeteer Stealth for advanced scraping

Web scraping in Puppeteer Stealth goes beyond basic scenarios. Let’s first explore a basic example and then dive into advanced use cases.

Basic web scraping 

The basic web scraping involves using Puppeteer Stealth to navigate to a website, interact with elements, and extract information.


const puppeteer = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(stealthPlugin());

puppeteer.launch().then(async browser => {
  const page = await browser.newPage();

  // Navigating to the target website
  await page.goto('https://example.com');

  // Extracting page title
  const pageTitle = await page.title();
  console.log('Page Title:', pageTitle);

  // Clicking on a button
  await page.click('button#exampleButton');

  // Extracting data from the clicked element
  const buttonText = await page.$eval('button#exampleButton', button => button.textContent);
  console.log('Button Text:', buttonText);

  // Additional scraping

  await browser.close();
});

In this code:

  • Puppeteer is configured with the Puppeteer Stealth plugin.
  • A new browser instance is launched and a new page is created.
  • The page navigates to https://example.com.
  • The title of the page is extracted and logged.
  • The script clicks on a button with the ID exampleButton.
  • It then extracts and logs the text content of the clicked button.
  • Now, you can insert your custom scraping logic based on the structure of the target website.
  • Finally, the browser is closed.

Now, let’s explore Puppeteer Stealth in advanced scenarios:

Handling AJAX requests


// ... (Previous code)

// Waiting for AJAX requests to complete
await page.waitForSelector('.ajax-loaded-element');

// Extracting data from the loaded content
const ajaxData = await page.$eval('.ajax-loaded-element', data => data.textContent);
console.log('AJAX-Loaded Data:', ajaxData);

// ... (Continuing with the scraping logic)

await browser.close();

  • page.waitForSelector('.ajax-loaded-element') - Puppeteer Stealth ensures that the script waits for the appearance of an element with the class ajax-loaded-element before proceeding. This waiting period allows time for AJAX requests to complete, ensuring that the dynamically loaded content is ready for extraction.
  • page.$eval('.ajax-loaded-element', data => data.textContent) - Once the AJAX requests are complete, Puppeteer Stealth enables the extraction of data from the loaded element. The $eval method is used to retrieve the text content of the element.
  • console.log('AJAX-Loaded Data:', ajaxData) - It logs the extracted data to the console for verification and analysis.

Interacting with forms

Web forms often act as gateways to access specific content or features on a website. In scraping scenarios, automating form interactions is crucial for navigating through protected areas or initiating search queries.

Interacting with forms

Web forms often act as gateways to access specific content or features on a website. In scraping scenarios, automating form interactions is crucial for navigating through protected areas or initiating search queries.


// ... (Previous code)

// Filling and submitting a form
await page.type('input#username', 'your-username');
await page.type('input#password', 'your-password');
await page.click('button#submit-button');

// ... (Continuing with the scraping logic)

await browser.close();

  • await page.type('input#username', 'your-username') - Puppeteer Stealth facilitates the filling of the input field for the username with the provided value.
  • await page.type('input#password', 'your-password') - Puppeteer Stealth simulates the typing of the provided password into the password input field.
  • await page.click('button#submit-button') - Puppeteer Stealth enables the script to simulate a click on the submit button. This triggers the form submission process.

Handling navigation events

The provided code snippet showcases the use of Puppeteer Stealth to handle navigation events during web scraping.


// ... (Previous code)

// Listening for navigation events
page.on('navigation', async () => {
  console.log('Page Navigated:', page.url());
  // Additional logic after each navigation event
});

// ... (Continuing with the scraping logic)

await browser.close();

  • page.on('navigation', async () => { ... }) - Puppeteer Stealth allows the script to set up an event listener to capture navigation events. In this case, the event is triggered whenever the page navigates.
  • console.log('Page Navigated:', page.url()) - After each navigation event, Puppeteer Stealth logs the URL to the console. This information is valuable for understanding the structure of the website and for making decisions on further scraping actions.

Dynamic user-agent rotation and device emulation

Below is a complete code example that demonstrates the implementation of Puppeteer Stealth for advanced scraping.


const puppeteer = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(stealthPlugin());

puppeteer.launch().then(async browser => {
  const page = await browser.newPage();

  // Enabling dynamic User-Agent rotation
  await page.evaluateOnNewDocument(() => {
    Object.defineProperty(navigator, 'userAgent', {
      get() {
        return 'your-dynamic-user-agent';      },
    });
  });

  // Emulating an iPhone X
  await page.emulate(puppeteer.devices['iPhone X']);

  // Scraping logic here
  await page.goto('https://example.com');
  const data = await page.evaluate(() => {
    return document.body.innerText;
  });

  console.log('Scraped Data:', data);

  await browser.close();
});

This script includes dynamic User-Agent rotation and device emulation, showcasing how to enhance stealth capabilities in your scraping activities. The page.goto and page.evaluate functions are placeholders for your specific scraping logic, allowing you to adapt the code to your unique requirements.

Troubleshooting common errors

In this section, we’ll troubleshoot common errors in advanced scraping with Puppeteer Stealth to ensure a seamless and undetected scraping experience.

Detection by anti-scraping mechanisms

Error Description: Your scraping activities face detection by anti-scraping mechanisms, leading to restrictions or blocks.

Cause: Insufficient stealth measures, such as static User-Agent or predictable browsing patterns.

Solution: Enhance stealth by adjusting the User-Agent rotation interval.


puppeteer.use(stealthPlugin({ userAgentRotationInterval: 5000 }));

Increasing the User-Agent rotation interval mimics a more human-like pace, reducing the risk of detection. Websites often track browsing behavior, and using a dynamic User-Agent rotation helps evade detection by presenting a changing fingerprint.

Form interactions not successful

Error Description: Form interactions in your script are not successfully filling or submitting the form fields.

Cause: Lack of optimization in the form interaction code, leading to automated patterns.

Solution: Optimize form interaction code by adding slight delays between keystrokes.


await page.type('input#username', 'your-username', { delay: 50 });
await page.type('input#password', 'your-password', { delay: 50 });

Websites often employ anti-bot measures that can detect automated form filling. Adding slight delays between keystrokes makes the interaction more human-like, reducing the chances of being flagged as a bot.

IP Blocking or CAPTCHA challenges

Error Description: Encountering IP blocking or CAPTCHA challenges, hindering the scraping process.

Cause: Unmasked IP addresses or failure to handle CAPTCHA prompts.

Solution: Implement proxy rotation to avoid IP blocking and use CAPTCHA solving services if needed.


const puppeteer = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');
const proxyChain = require('puppeteer-extra-plugin-proxy-chain');

puppeteer.use(stealthPlugin());
puppeteer.use(proxyChain({ proxies: ['proxy1', 'proxy2'] }));

Websites may block IP addresses associated with scraping activities. Proxy rotation helps to avoid IP restrictions, and CAPTCHA solving services assist in handling challenges that may arise during scraping.

Page load failures

Error Description: Experiencing failures in loading pages, leading to incomplete scraping.

Cause: Inadequate wait time for page loading or slow network conditions.

Solution: Adjust the wait time for page loading and implement retries in case of failures.


const page = await browser.newPage();
await page.goto('https://example.com', { waitUntil: 'domcontentloaded', timeout: 5000 });

Slow-loading pages or network issues can lead to page load failures. Increasing the timeout and configuring the waitUntil option ensures that the script allows sufficient time for the page to load successfully.

Element selection issues

Error Description: Facing challenges in selecting and interacting with specific elements on a webpage.

Cause: Weak or ambiguous selectors, or attempting to interact with elements before they are present.

Solution: Use more robust selectors and wait for elements to be present before interacting with them.


await page.waitForSelector('div#targetElement', { timeout: 5000 });
const targetElement = await page.$('div#targetElement');

Selectors that are too generic or not waiting for elements to be present can result in element selection issues. Using specific and robust selectors, along with waiting for elements using waitForSelector, ensures reliable interaction with targeted elements, reducing selection issues.

Conclusion

In the evolving landscape of web scraping, Puppeteer Stealth emerges as a tool for extracting data with precision and discretion. Browser fingerprint modification and intelligent User-Agent rotation set it apart in the realm of advanced scraping. Throughout the article, we discussed how Puppeteer Stealth goes beyond regular Puppeteer, enhancing stealth measures for advanced scraping tasks.

Related Articles

Guide to Puppeteer Extra: Best Plugins For Scraping Ranked

Proxy in Puppeteer: 3 Effective Setup Methods Explained

Puppeteer Scraping: Get Started in 10 Minutes