Main Website
Scraping
Web Scraping
Updated on
June 26, 2024

Downloading Files in Puppeteer: 5 Methods Explained

Downloading files in Puppeteer is a crucial feature for web automation tasks. Whether you are performing UI testing with Puppeteer, downloading PDFs, or handling file downloads in general, Puppeteer provides versatile methods to achieve these tasks. 

In this article, we’ll cover methods to download files using Puppeteer such as scraping all file download links into a database, downloading and auto-compressing files and a few more. Let’s dive in and explore these methods to unlock the full potential of Puppeteer’s file download functionalities.

Downloading a file from a web page using Puppeteer

Downloading a single file from a webpage can be achieved by using the page.click() method to simulate a click on the download link and handling the file download. Selectors are crucial in this process, as they allow you to precisely target the download link on the webpage.

Here’s an example of how to download a single file from a webpage using Puppeteer:


const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');
const url = require('url');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Set the download path
  const downloadPath = path.resolve('./downloads');
  fs.mkdirSync(downloadPath, { recursive: true });

  // Enable request interception
  await page.setRequestInterception(true);

  page.on('request', request => {
    // Continue all requests
    request.continue();
  });

  page.on('response', async response => {
    const requestUrl = response.url();

    // Check if the response is a file download (adjust the condition as needed)
    if (response.request().resourceType() === 'document' && requestUrl.endsWith('.pdf')) { // Adjust for your file type
      const fileName = path.basename(url.parse(requestUrl).pathname);
      const filePath = path.resolve(downloadPath, fileName);

      // Save the response buffer to a file
      const buffer = await response.buffer();
      fs.writeFileSync(filePath, buffer);
      console.log(`File downloaded to: ${filePath}`);
    }
  });

  // Navigate to the page containing the file
  await page.goto('https://example.com');

  // Simulate a click on the download link or button
  await page.click('a.download-link'); // Adjust the selector to match the download link on your target page

  // Wait for some time to ensure the download completes
  await page.waitForTimeout(5000); // Adjust the time as necessary

  await browser.close();
})();

This script sets up a download directory to store the downloaded file. To handle network requests and responses, request interception is enabled. During this phase, all requests are allowed to continue without interference, but when a response matching the file type (e.g., .pdf) is detected, the response is handled by saving the file locally. The script navigates to the target page using the page.goto() method. To initiate the download, the page.click() method simulates a click on the download link identified by the CSS selector a.download-link.

Finally, the script waits for a specified period to ensure the download completes before closing the browser. This approach ensures that the file is downloaded correctly and saved to the specified directory.

Batch file download from a page using Puppeteer

Downloading multiple files from a web page is a common requirement, especially when dealing with large datasets, batch processing, or scraping websites that offer multiple downloadable resources. Puppeteer can be effectively utilized to automate this process by iterating through a list of download links and handling each download sequentially.

Why download a batch of files?

Batch processing is crucial for automating tasks where multiple files need to be retrieved in a single operation. This can be beneficial for scenarios such as:

  • Collecting data from research databases
  • Downloading images or documents from online galleries or repositories
  • Automating the retrieval of reports or logs from web applications

Below is an example code demonstrating how to download a batch of files using Puppeteer. Make sure to install the axios package first using this command:


npm install axios


const puppeteer = require('puppeteer');
const path = require('path');
const fs = require('fs');
const axios = require('axios');

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    // Set the download path
    const downloadPath = path.resolve('./batch-downloads');
    fs.mkdirSync(downloadPath, { recursive: true });

    // Navigate to the page containing the files
    await page.goto('https://example.com/download-page');

    // Gather all download links
    const links = await page.$$eval('a.download-link', anchors => anchors.map(anchor => anchor.href));

    // Iterate through each link and download the file
    for (const link of links) {
        try {
            const response = await axios.get(link, { responseType: 'stream' });
            const fileName = path.basename(link);
            const filePath = path.resolve(downloadPath, fileName);
            const writer = fs.createWriteStream(filePath);

            response.data.pipe(writer);

            // Wait for the download to finish
            await new Promise((resolve, reject) => {
                writer.on('finish', resolve);
                writer.on('error', reject);
            });

            console.log(`Downloaded: ${fileName}`);
        } catch (error) {
            console.error(`Failed to download from link: ${link}`, error);
        }
    }

    console.log('Batch download completed.');

    await browser.close();
})();

In the above script, Puppeteer is launched and a new page instance is opened. A download directory batch-downloads is created to store the downloaded files. The script navigates to the target web page using page.goto() and extracts all download links a.download-link using page.$$eval(), mapping their href attributes to an array. For each link, axios is used to download the file, streaming the response to a file in the batch-downloads directory. The script waits for the download stream to complete before proceeding to the next link, logging messages for each successful download and errors for any failed attempts.

Capturing all file download links using Scraping in Puppeteer

In many scenarios, it’s essential to scrape all file download links from a webpage for further processing or archiving. This method involves extracting download URLs and saving them into a file or a database for easy access and retrieval. Puppeteer provides a robust way to achieve this through its powerful page manipulation capabilities.

Why scrape download links?

Scraping download links is useful in various contexts, such as:

  • Data Collection: Researchers can efficiently collect large datasets for analysis by scraping download links, enabling automated and systematic data acquisition.
  • Automation of Bulk Downloads: Scraping download links is essential for preparing download lists for batch processing, making it easier to manage and automate the download of multiple files simultaneously.
  • Saving Storage: In specific cases where storage space is limited, you may want to save just the download links without downloading the file itself or to accomplish the download on another local machine, avoiding cloud processing and its costs.

Here’s an example code demonstrating how to scrape all file download links from a webpage and save them into a JSON file:


const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    // Navigate to the page containing the files
    await page.goto('https://example.com/download-page');

    // Gather all download links
    const links = await page.$$eval('a.download-link', anchors => anchors.map(anchor => anchor.href));

    // Save links to a JSON file
    fs.writeFileSync('download-links.json', JSON.stringify(links, null, 2));

    console.log('Download links saved successfully.');

    await browser.close();
})();

  • Navigating to the Target Page: The page.goto() method is used to navigate to the webpage where the download links are located. 
  • Gathering Download Links: The page.$$eval() method is used to select all download links on the page and map their href attributes to an array. 
  • Saving Links to a JSON File: The extracted links are saved into a JSON file using fs.writeFileSync().

Automatically compressing files after downloading

When dealing with multiple file downloads, it can be beneficial to compress these files into a single archive. This not only saves storage space but also simplifies file management. Puppeteer can be combined with Node.js modules to automate the process of downloading files and compressing them into a zip archive.

Here’s an example code for downloading files using Puppeteer and compressing them into a zip archive. Make sure to install archiver first using this command:


npm install archiver


const puppeteer = require('puppeteer');
const archiver = require('archiver');
const fs = require('fs');
const path = require('path');

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    // Set up the download path
    const downloadPath = path.resolve('./temp-downloads');
    fs.mkdirSync(downloadPath, { recursive: true });

    // Enable download behavior and specify the download path
    const client = await page.target().createCDPSession();
    await client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: downloadPath });

    // Navigate to the page containing the files
    await page.goto('https://example.com/download-page');

    // Gather all download links
    const links = await page.$$eval('a.download-link', anchors => anchors.map(anchor => anchor.href));

    // Download each file
    for (const link of links) {
        const fileName = path.basename(link);
        const filePath = path.resolve(downloadPath, fileName);
        try {
            const response = await page.goto(link);
            fs.writeFileSync(filePath, await response.buffer());
            console.log(`Downloaded: ${fileName}`);
        } catch (error) {
            console.error(`Failed to download from link: ${link}`, error);
        }
    }

    // Create a zip archive
    const zipPath = path.resolve('./downloads/files.zip');
    const output = fs.createWriteStream(zipPath);
    const archive = archiver('zip', {
        zlib: { level: 9 } // Maximum compression
    });

    output.on('close', () => {
        console.log(`Archive created: ${zipPath} (${archive.pointer()} total bytes)`);
    });

    archive.on('error', err => {
        throw err;
    });

    archive.pipe(output);

    // Append files to the archive
    fs.readdirSync(downloadPath).forEach(file => {
        const filePath = path.resolve(downloadPath, file);
        archive.file(filePath, { name: file });
    });

    // Finalize the archive
    await archive.finalize();

    // Clean up the temporary download directory
    fs.rmSync(downloadPath, { recursive: true, force: true });

    await browser.close();
})();

  • Setting Up the Download Path: A temporary directory is created to store the downloaded files.
  • Enabling Download Behavior: With the Chrome DevTools Protocol (CDP) session setup using the createCDPSession method, download behavior is enabled, ensuring that downloaded files are directed to the specified download path..
  • Navigating to the Target Page: The page.goto() method is used to navigate to the webpage where the files are located.
  • Gathering Download Links: The page.$$eval() method is used to select all download links on the page and map their href attributes to an array.
  • Downloading Each File: The script loops through each link, downloads the file, and saves it to the temporary directory.
  • Creating a Zip Archive: The script creates a zip archive using archiver, adds each downloaded file to the archive, and finalizes the archive.
  • Cleaning Up: The temporary download directory is removed after the archive is created.

Downloading and saving files to cloud

Cloud storage solutions offer a flexible and scalable way to store files and it's essential in real life working applications. Integrating Puppeteer with cloud services enables automated downloading and saving files directly to the cloud, enhancing accessibility and data management. This method leverages Puppeteer for file extraction and a cloud storage SDK (such as AWS S3 or Google Cloud Storage) for uploading the files.

Here’s an example code for downloading files using Puppeteer and uploading them to AWS S3. First install aws-sdk using this command:


npm install aws-sdk


const puppeteer = require('puppeteer');
const AWS = require('aws-sdk');
const fs = require('fs');
const path = require('path');

// Configure AWS S3
const s3 = new AWS.S3({
    accessKeyId: 'your-access-key-id',
    secretAccessKey: 'your-secret-access-key',
    region: 'your-region'
});

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    // Set up the download path
    const downloadPath = path.resolve('./temp-downloads');
    fs.mkdirSync(downloadPath, { recursive: true });

    // Enable download behavior and specify the download path
    const client = await page.target().createCDPSession();
    await client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: downloadPath });

    // Navigate to the page containing the files
    await page.goto('https://example.com/download-page');

    // Gather all download links
    const links = await page.$$eval('a.download-link', anchors => anchors.map(anchor => anchor.href));

    // Download and upload each file
    for (const link of links) {
        const fileName = path.basename(link);
        const filePath = path.resolve(downloadPath, fileName);

        try {
            // Download the file
            const response = await page.goto(link);
            fs.writeFileSync(filePath, await response.buffer());
            console.log(`Downloaded: ${fileName}`);

            // Read the file
            const fileContent = fs.readFileSync(filePath);

            // Upload to S3
            const params = {
                Bucket: 'your-bucket-name',
                Key: fileName,
                Body: fileContent
            };

            await s3.upload(params).promise();
            console.log(`File uploaded successfully. ${fileName}`);

        } catch (error) {
            console.error(`Failed to download from link: ${link}`, error);
        }
    }

    // Clean up the temporary download directory
    fs.rmSync(downloadPath, { recursive: true, force: true });

    await browser.close();
})();

  • Configure AWS S3: The script starts by configuring AWS S3 with access credentials.
  • Launching Puppeteer and Opening a Page: Puppeteer is launched, and a new page instance is opened.
  • Setting Up the Download Path: A temporary directory temp-downloads is created to store the downloaded files.
  • Enabling Download Behavior: Puppeteer is configured to handle file downloads and save them to the specified directory using the createCDPSession method.
  • Navigating to the Target Page: The page.goto() method is used to navigate to the webpage where the files are located.
  • Gathering Download Links: The page.$$eval() method is used to select all download links a.download-link on the page and map their href attributes to an array.
  • Downloading and Uploading Each File: The script loops through each link, downloads the file, and saves it to the temporary directory. It then reads the file and uploads it to AWS S3.
  • Cleaning Up: The temporary download directory is removed after the files are uploaded to the cloud.

Conclusion

Downloading files using Puppeteer is a powerful and flexible way to automate web interactions and streamline the acquisition of resources from webpages. We covered five key methods: downloading a single file, downloading a batch of files, scraping all file download links into a file or database, downloading and auto-compressing files, and saving files to cloud storage. Each method offers a unique approach tailored to different needs, whether you are handling a single download or managing multiple files.

Downloading Images in Puppeteer: 6 Methods Explained

How to scrape websites using Puppeteer?

How to Get HTML in Puppeteer?