In web development and data analysis, extracting data from websites has become a crucial aspect for many applications. Web scraping can be a powerful tool for gathering data, automating tasks and performing various analyses. One notable framework that simplifies web scraping in a headless browser environment is Puppeteer, and when coupled with Node.js, it becomes one of the best frameworks for scraping. Let’s guide you through the basics of web scraping with Puppeteer and Node.js.
Jump to a section that’s relevant to you or simply continue reading:
These sections should get you started in no more than 10 minutes. If you are curious, learn about advanced scraping tasks like scraping iframes or getting and transforming HTML to PDF:
Before we dive into the world of Puppeteer and Node.js web scraping, you need to check off the following prerequisites:
Ensure that you have Node.js installed on your system. If you don’t have it installed, you can download the latest version from the official Node.js website. Follow the instructions for your operating system.
Create a new project directory where you’ll be working with Puppeteer. Open a terminal, navigate to your chosen directory, and run the following commands:
# Initialize a new Node.js project
npm init -y
# Install Puppeteer as a project dependency
npm install puppeteer
How scraping in Puppeteer works?
Now that we have our environment set up, let’s delve into the core concepts of scraping with Puppeteer.
Scraping text by Selector
When scraping text by selector in Puppeteer, we can leverage the page.$eval() that is particularly useful when you want to target a specific element, such as a paragraph, heading, or any other HTML tag.
Below is an example that uses the page.$eval() method to select an element and extract its text content. The extracted text will then be saved to a JSON file.
const puppeteer = require("puppeteer");
const fs = require("fs");
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto("https://www.webshare.io/");
// Use a CSS selector to get the text content of an element
const elementText = await page.$eval("h1", (element) => element.textContent);
// Create a JSON object to store the scraped data
const scrapedData = {
pageTitle: elementText,
timestamp: new Date().toISOString(),
// Add more key-value pairs as needed
};
// Convert the JSON object to a string
const jsonData = JSON.stringify(scrapedData, null, 2);
// Save the JSON data to a file
fs.writeFileSync("scraped_data.json", jsonData);
console.log("Data saved to scraped_data.json");
await browser.close();
})();
Here’s the generated output:
Scraping text by XPath
Here’s an example that uses the page.$x() method to select an element by XPath and extract its text content:
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({headless: "new"});
const page = await browser.newPage();
// Navigate to a webpage
await page.goto('https://www.webshare.io/');
// Select an element by class
const [element] = await page.$x('//p[contains(@class, "body_new")]');
// Check if the element is found
if (element) {
// Extract text content
const textContent = await page.evaluate(element => element.textContent, element);
// Save extracted data in JSON
const jsonData = JSON.stringify({ textContent });
fs.writeFileSync('./data/extractedData_xpath.json', jsonData, 'utf-8');
console.log('Extracted Text (XPath):', textContent);
} else {
console.error('Element not found.');
}
await browser.close();
})();
Here’s the generated output:
Scraping text by Class
The below example uses the page.$() method with a class selector to select an element and extract its text content.
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({headless: "new"});
const page = await browser.newPage();
// Navigate to a webpage
await page.goto('https://www.webshare.io/');
// Select an element by class
const element = await page.$('.icon-feature-content');
// Extract text content
const textContent = await page.evaluate(element => element.textContent, element);
// Save extracted data in JSON
const jsonData = JSON.stringify({ textContent });
fs.writeFileSync('./data/extractedData_class.json', jsonData, 'utf-8');
console.log('Extracted Text (Class):', textContent);
await browser.close();
})();
Here’s the generated output:
Scraping a single page
Now that we’ve covered the basics of selecting elements by selector, XPath and class, let’s put this knowledge into practice by scraping a single page. In the following example, we'll navigate to Webshare, and use page.content() to retrieve the entire HTML content of the page. The extracted HTML is then saved to a JSON file named 'extractedData_singlePage.json'.
const puppeteer = require("puppeteer");
const fs = require("fs");
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
// Navigate to a webpage
await page.goto("https://www.webshare.io/");
// Extract the entire HTML content of the page
const pageHTML = await page.content();
// Save the HTML content in JSON
const jsonData = JSON.stringify({ pageHTML });
fs.writeFileSync("./data/extractedData_singlePage.json", jsonData, "utf-8");
console.log("HTML Content of the Page:", pageHTML.length, "characters");
console.log("Data saved in ./data/extractedData_singlePage.json");
await browser.close();
})();
You can run the code as shown below:
The output looks like this:
Scraping multiple pages
The following example demonstrates how we can extract data from multiple pages of a website.
The script launches a headless browser, navigates to each URL, extracts the page title using page.title(), and stores the results in a JSON file. The loop iterates through the specified pages, logging the scraped titles and saving the data to 'extractedData_multiplePages.json'.
const puppeteer = require('puppeteer');
const fs = require('fs');
async function scrapeWebShare() {
const browser = await puppeteer.launch({headless: "new"});
const page = await browser.newPage();
// Array to store scraped data
const scrapedData = [];
// Define the URLs of the target pages
const pageUrls = [
'https://www.webshare.io/proxy-server',
'https://www.webshare.io/static-residential-proxy',
'https://www.webshare.io/residential-proxy',
];
// Loop through the specified pages
for (const [index, pageUrl] of pageUrls.entries()) {
// Navigate to the current page
await page.goto(pageUrl);
// Extract the title of the page
const pageTitle = await page.title();
// Add the title to the scraped data array
scrapedData.push({ page: index + 1, title: pageTitle });
console.log(`Scraped title from Page ${index + 1} (${pageUrl}):`, pageTitle);
}
// Save the scraped data to a JSON file
const jsonData = JSON.stringify(scrapedData);
fs.writeFileSync('./data/extractedData_multiplePages.json', jsonData, 'utf-8');
console.log('Scraped data saved in ./data/extractedData_multiplePages.json');
// Close the browser
await browser.close();
}
// Call the scraping function
scrapeWebShare();
Run the code to see the output as below:
Scraping all pages of a website
Scraping all pages of a website, similar to tools like Screaming Frog or in some ways to ScraperAPI, involves a comprehensive approach to navigate through the entire site and extract relevant information. In Puppeteer, we can achieve this by recursively visiting pages, discovering links and accumulating data.
Below is an example that scrapes all pages from Webshare, extracts their URLs and titles, and saves the information to a JSON file.
const puppeteer = require('puppeteer');
const fs = require('fs');
async function scrapeAllPages(url, maxDepth = 3) {
const browser = await puppeteer.launch({headless: "new"});
const page = await browser.newPage();
const visitedUrls = new Set();
const scrapedData = [];
async function scrapePage(currentUrl, depth) {
if (depth > maxDepth || visitedUrls.has(currentUrl)) {
return;
}
visitedUrls.add(currentUrl);
try {
// Navigation timeout of 60 seconds (adjust as needed)
await page.goto(currentUrl, { waitUntil: 'domcontentloaded', timeout: 60000 });
const pageTitle = await page.title();
scrapedData.push({ url: currentUrl, title: pageTitle });
const linkedUrls = await page.$$eval('a[href]', links =>
links.map(link => link.getAttribute('href'))
);
for (const linkedUrl of linkedUrls) {
const absoluteUrl = new URL(linkedUrl, currentUrl).href;
await scrapePage(absoluteUrl, depth + 1);
}
} catch (error) {
console.error(`Error navigating to ${currentUrl}:`, error.message);
}
}
await scrapePage(url, 0);
const jsonData = JSON.stringify(scrapedData);
fs.writeFileSync('./data/extractedData_allPages.json', jsonData, 'utf-8');
console.log('Scraped data saved in ./data/extractedData_allPages.json');
await browser.close();
}
// Specify the starting URL and maximum depth
scrapeAllPages('https://www.webshare.io/', 3);
Here’s how your output will look like:
Advanced scraping task examples
As you delve into more advanced scraping tasks with Puppeteer, you can encounter scenarios like scraping content within iframes and transforming HTML to PDF. The following sections illustrate how to tackle these tasks efficiently.
iFrame scraping
To scrape content within an iframe using Puppeteer, you need to switch to the iframe context and then perform operations within it. The below script navigates to a page, identifies the Trustpilot iframe using its title attribute, switches to the iframe context, and extracts the HTML content within the iframe's body. The extracted content is then stored in 'extractedData_iframe.json'.
const puppeteer = require('puppeteer');
const fs = require('fs');
async function scrapeTrustpilotReviews(url) {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto(url);
// Identify the iframe using a selector or any other appropriate method
const iframeHandle = await page.$('iframe[title="Customer reviews powered by Trustpilot"]');
const iframeContent = await iframeHandle.contentFrame();
// Extract the content within the Trustpilot iframe
const trustpilotContent = await iframeContent.$eval('body', element =>
element.innerHTML.trim()
);
// Store the extracted content in JSON format
const jsonData = JSON.stringify({ trustpilotContent });
fs.writeFileSync('./data/extractedData_iframe.json', jsonData, 'utf-8');
await browser.close();
}
// Specify the URL containing the Trustpilot iframe
scrapeTrustpilotReviews('https://www.webshare.io/');
Here’s how the output will look like:
Get HTML and transform to PDF
Puppeteer provides the capability to capture HTML content and transform it into a PDF file. The example code below illustrates how to fetch HTML from a page, save it to a JSON file, and then convert it into a PDF.
const puppeteer = require('puppeteer');
const fs = require('fs');
async function saveHTMLandPDF(url) {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto(url);
// Extract HTML content
const htmlContent = await page.content();
// Store HTML content in JSON format
const jsonData = JSON.stringify({ htmlContent });
fs.writeFileSync('./data/extractedData_html.json', jsonData, 'utf-8');
// Convert HTML to PDF
await page.pdf({ path: './data/extractedData_pdf.pdf', format: 'A4' });
await browser.close();
}
// Specify the URL to fetch HTML and transform to PDF
saveHTMLandPDF('https://www.webshare.io/');
Here’s how the json will look like:
Below is the generated pdf file:
Dealing with anti-scraping measures
Web scraping often encounters anti-scraping measures implemented by websites to prevent automated access. To overcome these challenges, several strategies can be employed, including the use of proxies, Puppeteer Extra library and modification of user agents.
Proxies
Proxies play a crucial role in mitigating the risk of being blocked during scraping activities by masking the origin IP address. The following code demonstrates how to integrate proxies into a Puppeteer script:
const useProxy = true;
// Use a proxy (replace with your proxy details)
if (useProxy) {
const proxyUrl = 'http://your-proxy-url.com';
await page.authenticate({ username: 'your-username', password: 'your-password' });
await page.setExtraHTTPHeaders({ 'Proxy-Authorization': 'Basic ' + Buffer.from('your-username:your-password').toString('base64') });
}
By setting useProxy to true, the script configures Puppeteer to use the specified proxy details, allowing for more discreet and distributed web scraping.
Puppeteer Extra
Puppeteer Extra is an extension library for Puppeteer that equips it with stealth capabilities to bypass anti-bot measures. The script below shows the integration of Puppeteer Extra:
Modifying the user agent allows a script to emulate different browsers, potentially bypassing user-agent-based checks. The following code depicts how to set a custom user agent in Puppeteer:
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
By setting a custom user agent, the script can mimic different browsers, making it harder for websites to identify and block automated scraping activities. This strategy adds an extra layer of disguise to the scraping process. To add an additional layer of customization, you can incorporate cookies in the script using the page.setCookie method. Learn more on handling cookies as some websites use them to track user sessions and behavior.
Error handling tips
Efficient error handling is crucial in web scraping to gracefully manage unexpected issues and ensure the robustness of your script. Below are some key error-handling tips to enhance the reliability of your Puppeteer web scraping code.
Page load failures
Handle page load failures by catching navigation errors, logging the issue, and taking appropriate actions.
try {
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });
} catch (error) {
console.error(`Error navigating to ${url}:`, error.message);
// Take appropriate action (e.g., skip or retry)
}
Timeouts
Adjust timeouts based on the expected loading times of pages or elements. This helps prevent premature script termination due to default timeout settings.
In this article, we covered basic and intermediate level web scraping techniques using Puppeteer and Node.js, handling anti-scraping measures, error handling tips, and advanced scraping tasks such as working with iframes and transforming HTML to PDF. Developers, equipped with this knowledge, can efficiently navigate through web scraping challenges and extract valuable data from diverse sources.