Main Website
Scraping
Web Scraping
Updated on
March 25, 2024

Pyppeteer Tutorial: Installation & Code Examples

Pyppeteer is a Python library for web browser automation. It is the unofficial Python port of Puppeteer, a well-known library in the JavaScript community. With Pyppeteer, you can control web browsers, automate tasks, and scrape data from websites using Python. This tutorial will guide you through the installation process and provide some basic code examples. If you are learning or working on web automation or scraping, Pyppeteer is an essential tool to know. Let's get started.

How Pyppeteer is different from Puppeteer?

At their core, Pyppeteer and Puppeteer both provide a high-level API to control headless Chrome or Chromium browsers. Pyppeteer is a Python adaptation of Puppeteer, which is designed for JavaScript.

Although both libraries aim to control browsers, they have distinctions in language syntax and the handling of asynchronous tasks. Specifically, Pyppeteer utilizes Python's asyncio, whereas Puppeteer employs JavaScript's Promises. This difference influences their method calls and overall workflow.

Additionally, while both engage with the Chrome DevTools Protocol, their underlying architecture has subtle variations. The way they manage events, sessions, and browser contexts can differ, potentially affecting performance or behavior in certain situations.

For usability, developers familiar with the Python ecosystem might find Pyppeteer more intuitive, because it aligns well with Python's conventions. On the other hand, Puppeteer, deeply rooted in the JavaScript ecosystem, provides a great experience for those familiar with Node.js and related tools.

How is Pyppeteer different from Selenium?

Selenium is a widely recognized tool for automating web browsers for a range of tasks from testing web applications to web scraping. When comparing Pyppeteer and Selenium, there are notable differences.

While Selenium interacts with multiple browsers like Firefox, Chrome, and Edge, Pyppeteer is designed specifically for the Chrome or Chromium browser. Secondly, Pyppeteer communicates directly with the Chrome DevTools Protocol, offering finer control over browser sessions, which can sometimes result in faster performance. Thus, Selenium's approach is broader, providing a more general browser automation framework, whereas Pyppeteer offers a more Chrome-centric experience.

Prerequisites

Before starting with Pyppeteer, it's important to have the necessary tools and setups in place. The primary prerequisite for Pyppeteer is having Python version 3.6 or newer installed. If you haven't, it can be easily downloaded from python.org.

Furthermore, while Pyppeteer naturally works in headless mode without a graphical user interface, installing Chrome or Chromium can help in debugging and visualization.

Finally, for efficient and anonymous scraping, proxies are essential. When it comes to proxies for Puppeteer or Pyppeteer, Webshare stands out as a reliable option. We offer 10 premium proxies for free, which can be especially beneficial for initial testing and understanding the process of web scraping with Pyppeteer. 

How to install Pyppeteer?

Installation Pyppeteer on your machine is a straightforward process. You can use the pip standard package manager for installing it. To install Pyppeteer, run the command given below:


pip install pyppeteer

When you install Pyppeteer using pip, it does not immediately download Chromium. Instead, the first time you run a Pyppeteer script, it will download a recent version of Chromium. If you want to avoid this behavior during the initial run of a Pyppeteer script, you can pre-download Chromium using the pyppeteer-install command.

You can create any Python file like index.py to write the scripts which are shown below. 

Setting up a basic browser session

To start, let’s learn how to launch a browser session and open a web page. 


import asyncio
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://www.python.org')
    await browser.close()


asyncio.get_event_loop().run_until_complete(main())

In the code above, we use the Pyppeteer library to perform asynchronous browser automation. First, we launch a new browser instance and open a fresh page. We then navigate to Python's official website. After the actions are completed, the browser session is closed. The last line is essential for running the asynchronous function, which ensures that the commands inside the main() function are executed.

Selecting Elements (Xpath, Selector, Text methods)

In web scraping and test automation, selecting specific elements on a page is a foundational step. Pyppeteer offers multiple methods to get elements from a webpage which are very similar to getting elements in Puppeteer

  1. Using XPath
  2. Using CSS selectors
  3. Using Text inside elements

To show how these methods work, we will use the Donate button in the python.org website. 

XPath is a powerful querying language for selecting nodes from an XML-like document, such as HTML. The syntax to use it is given below. 


elements = await page.xpath('//tag_name[@attribute="value"]')

To select the "Donate" button on python.org using XPath, you can use the following code line.


donate_button_xpath = await page.xpath('//a[contains(text(), "Donate")]')

CSS Selectors are patterns employed to select elements based on their attributes, such as class or ID. it has the following syntax.


element = await page.querySelector('selector_here')

To select the "Donate" button on python.org using its CSS class, use the following code. 


donate_button_css = await page.querySelector('.donate-button')

To select elements by text in Pyppeteer, XPath is commonly employed. Here is the syntax for it. 


element = await page.querySelector('xpath=//tag[text()="desired_text_here"]')

To select the "Donate" button on python.org by its text, you can use the following code. 


donate_button_text = await page.xpath('//a[text()="Donate"]')

Waiting for the page to load

When automating browser tasks, you need to ensure web pages fully load before proceeding. This ensures all elements are accessible. After clicking a link, such as the "Donate" button on python.org, you would wait for the subsequent page to load using a code similar to the following code snippet.


donate_button = await page.querySelector('.donate-button')
await donate_button.click()
await page.waitForNavigation()

Using Click

One of the most common actions in web navigation is clicking elements. You can do it effortlessly with Pyppeteer which is very similar to Click in Puppeteer. For example, if you wish to click the "Donate" button on the python.org website, you would identify the button and then use the click method.


donate_button = await page.querySelector('.donate-button')
await donate_button.click()

Taking Screenshots

There are numerous scenarios where capturing a website's current appearance can be invaluable. To obtain a snapshot of the python.org website, you'd go to the site and then capture the screenshot as shown below. 


await page.goto('https://www.python.org')
await page.screenshot({'path': 'python_org.png'})

After running this code you will get a python_org.png image as shown below.

Handling PDF files 

Pyppeteer also grants the capability to transform web pages into PDF files. For example, converting the python.org website into a PDF document follows a similar flow to taking screenshots. 


await page.goto('https://www.python.org')
await page.pdf({'path': 'python_org.pdf'})

Using waitUntil

Different websites load content at varied paces. Sometimes it's important to wait until certain elements or the entire content is loaded. Pyppeteer provides the waitUntil option to manage such scenarios. Here is an example code of how to use waitUntil.


await page.goto('https://www.python.org', {'waitUntil': 'networkidle0'})

User Agent setup

Manipulating the User Agent string can sometimes be necessary, either for testing purposes or to mimic a particular browsing environment. To do that with Pyppeteer, you can use a code snippet similar to the one below. 


user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
await page.setUserAgent(user_agent)
await page.goto('https://www.python.org')

If you want to learn more about user agents in Puppeteer read this article: User Agents in Puppeteer.

Scraping with Pyppeteer

Pyppeteer is a great option for web scraping. However, it also faces the same challenges common to every web scraping framework. One of the best solutions to avoid these challenges is to use proxies.

Here's how you can set up a proxy in Pyppeteer.


browser = await launch(args=['--proxy-server=http://:'])

If your proxy requires authentication, Pyppeteer has a way to do that too. Once the page is loaded, you can authenticate using the following code line.


await page.authenticate({'username': '', 'password': ''})

Tips for a higher success rate

These are some tips you can employ to achieve a high success rate in web scraping with Pyppeteer. 

  • Regularly rotate User Agents and headers to mimic different browsing scenarios.
  • Introduce delays between your requests. It appears more "human" and reduces the chance of overloading the server or triggering anti-bot mechanisms.
  • Use Python's asyncio to manage multiple scraping tasks concurrently, improving efficiency without overloading the target server.
  • Disable images, CSS, or JavaScript when they aren't necessary for your scraping objectives.
  • Proper handling and rotation of cookies in Puppteer or Pyppeteer can help maintain session persistence, and appear more authentic to websites. 

Conclusion

In conclusion, Pyppeteer is a robust and efficient tool for browser task automation and scraping of web data. Pyppeteer enables users to use a more "Pythonic" approach to web automation compared to Puppeteer and Selenium. Through this tutorial, we have discussed the installation and fundamental operations of Pyppeteer, touched upon best practices and strategies for successful web scraping, including the use of proxies which enable successful web scraping. Pyppeteer stands out as a valuable library for developers who favor a more Python based approach, which streamlines workflows and extends the limits of web automation.

Related Articles

Using Puppeteer on AWS Lambda for Scraping

Guide to Puppeteer Extra: Best Plugins For Scraping Ranked

How to Use Puppeteer Stealth For Advanced Scraping?