Main Website
Scraping
Updated on
April 2, 2025

Web Scraping on AWS Free Tier with Scrapy

Web scraping allows developers to extract valuable data from websites automatically, making it a powerful tool for data analysis, market research, and automation. However, running a scraper on personal machines can be inefficient and unreliable, especially for long-running tasks. This is where AWS Free Tier comes in – it provides a cost-effective way to deploy and manage web scraping projects using EC2 for execution and S3 for data storage.

In this guide, we’ll explore how to set up a web scraping project on AWS Free Tier and understand the key components involved. We’ll discuss how EC2 instances can be used to run a Scrapy-based web scraper and how an S3 bucket can store scraped data efficiently. Additionally, we’ll cover strategies to automate the scraping process and ways to optimize cloud scraping costs to stay within AWS Free Tier limits.

Setting up an AWS Free Tier account

Before deploying a web scraper on AWS, you need to create an AWS Free Tier account. Follow these steps to set up your account:

Step 1: Sign up for AWS

  1. Open the AWS Sign-up page in your browser.

  1. Enter your email address and create an AWS account name.
  2. Click Verify Email Address and enter the verification code sent to your email.

Step 2: Set up the Root user password

  1. Create a secure password and confirm it.
  2. Click Continue to move to the next step.

Step 3: Select account type

  1. Choose Personal for individual use.
  2. Provide personal details such as full name, contact number, country, and address.
  3. Click Continue to proceed.

Step 4: Enter billing information

  1. Add your payment details (credit or debit card).
  2. Click Verify and Continue to proceed.

Step 5: Verify your identity

  1. Choose a verification method – either through a mobile number or email.
  2. Enter the verification code you receive and confirm.
  3. Click Continue to proceed.

Step 6: Choose a support plan

  1. Select Basic Support (Free) to avoid extra charges.
  2. Click Complete Sign up to finalize your setup.

Step 7: AWS account confirmation

Once the setup is complete, AWS will confirm your account activation. You can now sign in to the AWS Management Console and start using Free Tier services.

Setting up an EC2 instance for scraping

An EC2 instance is a virtual machine on AWS that allows you to run applications remotely. For web scraping, we’ll use an EC2 Free Tier instance to execute Scrapy and extract data.

Steps to launch an EC2 instance

Step 1 - Go to EC2 Dashboard: 

  • Sign in to your AWS account
  • Navigate to the AWS Management Console and click on Services in the top-left menu.
  • From the list of services, select EC2 to open the EC2 Dashboard.

Step 2 - Start Creating an Instance:

  • Click Launch Instance to begin the setup process.
  • You’ll be redirected to the configuration page, where you’ll need to specify the instance details.

Step 3 - Choose an Amazon Machine Image (AMI): 

  • Select the operating system for your instance.
  • AWS provides multiple options - if you’re using Python-based scraping, Ubuntu 22.04 LTS is a recommended choice.

Step 4 - Select an Instance Type: 

  • Choose an instance type based on your project’s needs.
  • For Free Tier eligibility, select t2.micro, which includes 1 vCPU and 1 GB RAM.
  • Avoid selecting higher-tier instances, as they may incur additional costs.

Step 5 - Configure Security and Key Pair:

  • Create a new key pair by clicking on Create new key pair.
  • Choose .pem format and download the key securely—this will be required to connect via SSH.
  • Keep the default security settings unless you need specific network configurations.

Step 6 - Configure Storage and Network Settings:

  • AWS Free Tier provides 30 GB of EBS storage which is sufficient for small projects.
  • Keep network settings at default unless specific changes are required.
  • Ensure Auto-assign Public IP is enabled so you can access the instance remotely.

Step 7 - Launch the Instance:

  • Double-check all configurations to ensure they fall within Free Tier limits.
  • Click Launch Instance to create your server.

Step 8 - Connect to Your EC2 Instance: 

Once the instance is running, connect via SSH using:

ssh -i your-key.pem ubuntu@your-ec2-public-ip

Installing Python and Scrapy on EC2

Once connected to your instance, install the necessary tools:

# Update packages
sudo apt update && sudo apt upgrade -y

# Install Python and essential dependencies for Scrapy
sudo apt install -y python3 python3-pip python3-dev libxml2-dev libxslt1-dev libssl-dev

# Install Scrapy
pip3 install scrapy

Writing a simple Scrapy spider on EC2

Scrapy is a powerful Python framework for web scraping that allows you to send HTTP requests, parse HTML, and extract structured data efficiently. It is designed to handle large-scale scraping with built-in features like request queuing, middleware, and export pipelines. Now that our EC2 instance is set up with Python and Scrapy, let’s create a basic Scrapy project to extract data from a website.

Step 1: Creating a Scrapy project

Once connected to your EC2 instance via SSH, navigate to your desired directory and create a new Scrapy project:

scrapy startproject aws_scraper
cd aws_scraper

This command creates a directory structure like this:

aws_scraper/
    aws_scraper/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
    spiders/
        __init__.py
    scrapy.cfg

The key folders and files are:

  • spiders/ - Where the web scraping logic is written.
  • items.py - Defines the structure of the extracted data.
  • pipelines.py - Handles post-processing (e.g., saving data to S3).
  • settings.py - Configures Scrapy settings like delays and user agents.

Step 2: Writing a Scrapy spider

Now, let’s create a simple spider to scrape quotes from http://quotes.toscrape.com, a test website designed for scraping practice. Run the below code:

cd spiders
nano quotes_spider.py

Paste the following Scrapy spider code:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        # Follow pagination links
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

This spider starts at http://quotes.toscrape.com, as defined in start_urls, and extracts quote text, author names, and associated tags using CSS selectors. It then follows pagination links to navigate through multiple pages, ensuring comprehensive data collection. The extracted information is yielded as a Python dictionary, which Scrapy processes automatically.

Step 3: Running the Scrapy spider

Run the spider and display the output in the terminal using the below command:

scrapy crawl quotes

Save the output in JSON format as:

scrapy crawl quotes -o quotes.json

You should see output like this:

[
    {'text': '"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']},
    {'text': '"It is our choices, Harry, that show what we truly are, far more than our abilities."', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']},
...
]

Step 4: Storing scraped data in S3

Instead of saving data locally, we can store it in an AWS S3 bucket by following the below steps:

1. Configure IAM Permissions for S3: Before proceeding, ensure your EC2 instance has an IAM role with necessary permissions to upload files to S3. Follow these steps:

  • Go to the AWS IAM Console and create a new IAM Role for EC2.
  • Attach a custom policy that grants only the required permission (s3:PutObject) instead of using AmazonS3FullAccess, which provides broader access than necessary. Example policy is as:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    }
  ]
}
  • Attach the role to your EC2 instance under EC2 > Instances > Actions > Security > Modify IAM Role.
  • Select the newly created IAM role and save the changes.

2. Install Boto3 for AWS SDK: Install the boto3 library, which allows Python to interact with AWS services:

pip3 install boto3

Modify pipelines.py to save data to S3:

import boto3
import json

class SaveToS3Pipeline:
    def __init__(self):
        self.s3 = boto3.client("s3", region_name="us-west-2")  # Set the correct region
        self.bucket_name = "your-s3-bucket-name"

    def process_item(self, item, spider):
        file_name = f"quotes/{item['author'].replace(' ', '_')}.json"
        self.s3.put_object(
            Bucket=self.bucket_name,
            Key=file_name,
            Body=json.dumps(item, indent=4),
            ContentType="application/json"
        )
        return item

Enable the pipeline in settings.py:

ITEM_PIPELINES = {
    'aws_scraper.pipelines.SaveToS3Pipeline': 300,
}

Now, when you run the scraper, data will be saved to S3 instead of your EC2 instance.

Alternative Approach: Instead of a custom pipeline, you can also use Scrapy’s built-in S3 support:

scrapy crawl quotes -o s3://your-bucket-name/quotes.json

Requirements:

  • Install scrapy-exporters for JSON format:
pip3 install scrapy-exporters
  • Configure AWS credentials via environment variables or ~/.aws/credentials.

Automating the scraping process on AWS

Manually running the Scrapy spider every time you need new data isn’t efficient. Instead, we can automate the scraping process by scheduling it to run at specific intervals. AWS provides multiple ways to do this, but we’ll focus on three common approaches:

1. Using cron jobs on EC2

A cron job is a built-in Linux utility that allows scheduled execution of tasks. If the scraper is running on an EC2 instance, a cron job can be set up to execute the Scrapy spider at predefined intervals (e.g., every hour or daily). While this method is simple to implement, the EC2 instance must remain active, which may lead to unnecessary costs if not managed properly.

2. Using AWS Lambda with EventBridge

AWS Lambda provides a serverless way to automate scraping tasks. However, Scrapy isn’t natively compatible with Lambda due to binary dependencies like Twisted. To make it work, Scrapy and its dependencies must be packaged using a Lambda layer or a container image.

Example solution:

  • Package Scrapy using AWS SAM or Docker, then deploy it as a Lambda function.
  • Use EventBridge to schedule the function execution at desired intervals.

This approach eliminates the need for a constantly running EC2 instance, reducing infrastructure costs. However, since AWS Lambda has execution time limits (15 minutes per run), it is best suited for lightweight scraping tasks rather than large-scale data extraction.

3. EC2 scheduling with AWS Systems Manager

For cases where the scraper must run on EC2 but without keeping the instance active all the time, AWS Systems Manager can be used to start and stop the EC2 instance at specific times. This ensures that the instance is only running when necessary, optimizing resource usage while maintaining flexibility for more complex scraping tasks.

Optimizing cloud scraping costs

Efficient cost management is essential when running scrapers on AWS Free Tier, as exceeding limits can lead to unexpected charges. Here are some strategies to minimize cloud scraping costs:

Keeping EC2 usage minimal

Since EC2 instances contribute significantly to AWS costs, it’s best to use them only when necessary. This can be achieved by:

  • Running the scraper only at scheduled intervals instead of keeping the instance always active.
  • Shutting down or terminating the EC2 instance after the scraping task is complete.
  • Using AWS Systems Manager to automate instance start and stop times.

Using Spot instances vs. On-Demand instances

AWS Spot Instances are significantly cheaper than On-Demand Instances, reducing costs by up to 90%. However, they aren’t included in the Free Tier. Spot Instances are ideal for non-Free-Tier workloads that can tolerate interruptions.

Best practices:

  • Use Spot Instances only if the workload is flexible and interruptions won’t affect data integrity.
  • Implement automatic retries in case a Spot Instance is terminated unexpectedly.

Wrapping up: Web scraping on AWS Free Tier

Deploying a web scraper on AWS Free Tier provides a cost-effective and scalable solution for data extraction. By using EC2 for execution, S3 for storage, and automation tools like cron jobs or AWS Systems Manager, you can streamline scraping workflows while minimizing manual intervention. To optimize costs within the Free Tier, consider stopping EC2 instances after execution rather than keeping them running continuously. For workloads outside the Free Tier, Spot Instances can reduce costs by up to 90%, though they come with the risk of termination. If using AWS Lambda, containerizing Scrapy ensures compatibility with its execution environment. With the right strategies in place, AWS enables efficient web scraping while keeping infrastructure costs under control.

Using Puppeteer on AWS Lambda for Scraping

Web Scraping in Scrapy: Working Example [5 Minutes]

Rotating Proxies in Scrapy: 2 Methods Explained