Buy fast & affordable proxy servers. Get 10 proxies today for free.
Download our Proxy Server Extension
© Webshare Proxy
payment methods
Web scraping allows developers to extract valuable data from websites automatically, making it a powerful tool for data analysis, market research, and automation. However, running a scraper on personal machines can be inefficient and unreliable, especially for long-running tasks. This is where AWS Free Tier comes in – it provides a cost-effective way to deploy and manage web scraping projects using EC2 for execution and S3 for data storage.
In this guide, we’ll explore how to set up a web scraping project on AWS Free Tier and understand the key components involved. We’ll discuss how EC2 instances can be used to run a Scrapy-based web scraper and how an S3 bucket can store scraped data efficiently. Additionally, we’ll cover strategies to automate the scraping process and ways to optimize cloud scraping costs to stay within AWS Free Tier limits.
Before deploying a web scraper on AWS, you need to create an AWS Free Tier account. Follow these steps to set up your account:
Once the setup is complete, AWS will confirm your account activation. You can now sign in to the AWS Management Console and start using Free Tier services.
An EC2 instance is a virtual machine on AWS that allows you to run applications remotely. For web scraping, we’ll use an EC2 Free Tier instance to execute Scrapy and extract data.
Step 1 - Go to EC2 Dashboard:
Step 2 - Start Creating an Instance:
Step 3 - Choose an Amazon Machine Image (AMI):
Step 4 - Select an Instance Type:
Step 5 - Configure Security and Key Pair:
Step 6 - Configure Storage and Network Settings:
Step 7 - Launch the Instance:
Step 8 - Connect to Your EC2 Instance:
Once the instance is running, connect via SSH using:
ssh -i your-key.pem ubuntu@your-ec2-public-ip
Once connected to your instance, install the necessary tools:
# Update packages
sudo apt update && sudo apt upgrade -y
# Install Python and essential dependencies for Scrapy
sudo apt install -y python3 python3-pip python3-dev libxml2-dev libxslt1-dev libssl-dev
# Install Scrapy
pip3 install scrapy
Scrapy is a powerful Python framework for web scraping that allows you to send HTTP requests, parse HTML, and extract structured data efficiently. It is designed to handle large-scale scraping with built-in features like request queuing, middleware, and export pipelines. Now that our EC2 instance is set up with Python and Scrapy, let’s create a basic Scrapy project to extract data from a website.
Once connected to your EC2 instance via SSH, navigate to your desired directory and create a new Scrapy project:
scrapy startproject aws_scraper
cd aws_scraper
This command creates a directory structure like this:
aws_scraper/
aws_scraper/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
scrapy.cfg
The key folders and files are:
Now, let’s create a simple spider to scrape quotes from http://quotes.toscrape.com, a test website designed for scraping practice. Run the below code:
cd spiders
nano quotes_spider.py
Paste the following Scrapy spider code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
# Follow pagination links
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
This spider starts at http://quotes.toscrape.com, as defined in start_urls, and extracts quote text, author names, and associated tags using CSS selectors. It then follows pagination links to navigate through multiple pages, ensuring comprehensive data collection. The extracted information is yielded as a Python dictionary, which Scrapy processes automatically.
Run the spider and display the output in the terminal using the below command:
scrapy crawl quotes
Save the output in JSON format as:
scrapy crawl quotes -o quotes.json
You should see output like this:
[
{'text': '"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']},
{'text': '"It is our choices, Harry, that show what we truly are, far more than our abilities."', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']},
...
]
Instead of saving data locally, we can store it in an AWS S3 bucket by following the below steps:
1. Configure IAM Permissions for S3: Before proceeding, ensure your EC2 instance has an IAM role with necessary permissions to upload files to S3. Follow these steps:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::your-bucket-name/*"
}
]
}
2. Install Boto3 for AWS SDK: Install the boto3 library, which allows Python to interact with AWS services:
pip3 install boto3
Modify pipelines.py to save data to S3:
import boto3
import json
class SaveToS3Pipeline:
def __init__(self):
self.s3 = boto3.client("s3", region_name="us-west-2") # Set the correct region
self.bucket_name = "your-s3-bucket-name"
def process_item(self, item, spider):
file_name = f"quotes/{item['author'].replace(' ', '_')}.json"
self.s3.put_object(
Bucket=self.bucket_name,
Key=file_name,
Body=json.dumps(item, indent=4),
ContentType="application/json"
)
return item
Enable the pipeline in settings.py:
ITEM_PIPELINES = {
'aws_scraper.pipelines.SaveToS3Pipeline': 300,
}
Now, when you run the scraper, data will be saved to S3 instead of your EC2 instance.
Alternative Approach: Instead of a custom pipeline, you can also use Scrapy’s built-in S3 support:
scrapy crawl quotes -o s3://your-bucket-name/quotes.json
Requirements:
pip3 install scrapy-exporters
Manually running the Scrapy spider every time you need new data isn’t efficient. Instead, we can automate the scraping process by scheduling it to run at specific intervals. AWS provides multiple ways to do this, but we’ll focus on three common approaches:
A cron job is a built-in Linux utility that allows scheduled execution of tasks. If the scraper is running on an EC2 instance, a cron job can be set up to execute the Scrapy spider at predefined intervals (e.g., every hour or daily). While this method is simple to implement, the EC2 instance must remain active, which may lead to unnecessary costs if not managed properly.
AWS Lambda provides a serverless way to automate scraping tasks. However, Scrapy isn’t natively compatible with Lambda due to binary dependencies like Twisted. To make it work, Scrapy and its dependencies must be packaged using a Lambda layer or a container image.
Example solution:
This approach eliminates the need for a constantly running EC2 instance, reducing infrastructure costs. However, since AWS Lambda has execution time limits (15 minutes per run), it is best suited for lightweight scraping tasks rather than large-scale data extraction.
For cases where the scraper must run on EC2 but without keeping the instance active all the time, AWS Systems Manager can be used to start and stop the EC2 instance at specific times. This ensures that the instance is only running when necessary, optimizing resource usage while maintaining flexibility for more complex scraping tasks.
Efficient cost management is essential when running scrapers on AWS Free Tier, as exceeding limits can lead to unexpected charges. Here are some strategies to minimize cloud scraping costs:
Since EC2 instances contribute significantly to AWS costs, it’s best to use them only when necessary. This can be achieved by:
AWS Spot Instances are significantly cheaper than On-Demand Instances, reducing costs by up to 90%. However, they aren’t included in the Free Tier. Spot Instances are ideal for non-Free-Tier workloads that can tolerate interruptions.
Best practices:
Deploying a web scraper on AWS Free Tier provides a cost-effective and scalable solution for data extraction. By using EC2 for execution, S3 for storage, and automation tools like cron jobs or AWS Systems Manager, you can streamline scraping workflows while minimizing manual intervention. To optimize costs within the Free Tier, consider stopping EC2 instances after execution rather than keeping them running continuously. For workloads outside the Free Tier, Spot Instances can reduce costs by up to 90%, though they come with the risk of termination. If using AWS Lambda, containerizing Scrapy ensures compatibility with its execution environment. With the right strategies in place, AWS enables efficient web scraping while keeping infrastructure costs under control.
Using Puppeteer on AWS Lambda for Scraping
Web Scraping in Scrapy: Working Example [5 Minutes]
Rotating Proxies in Scrapy: 2 Methods Explained