What We’ll Build
We’ll create a spider that scrapes quotes from quotes.toscrape.com, a website designed for practicing web scraping. Our spider will:- Start from the homepage
- Extract quotes, authors, and tags from each page
- Follow pagination links automatically
- Export all data to JSON
Prerequisites
Make sure you have Scrapling installed:Step 1: Basic Spider Structure
Let’s start with the absolute minimum spider:quotes_spider.py
name: A unique identifier for your spider. Required.start_urls: List of URLs where the spider begins crawling. Required.parse(): The default callback method that processes responses. Must be an async generator.
Running Your Spider
Run the spider:- Log messages showing the spider starting and finishing
- The scraped item:
[{'url': 'https://quotes.toscrape.com'}] - Statistics about the crawl
Step 2: Extracting Data
Now let’s extract actual quote data. First, inspect the HTML structure:parse() method:
quotes_spider.py
Understanding Selectors
response.css('div.quote'): Finds all<div class="quote">elements::text: CSS pseudo-element that extracts text content.get(): Returns the first match (or None).getall(): Returns all matches as a list
Step 3: Following Pagination
The website has multiple pages. Let’s make the spider follow the “Next” button:quotes_spider.py
response.follow(): Creates a new request from a relative or absolute URLcallback=self.parse: Tells the spider to process the response withparse()- The spider automatically handles relative URLs (like
/page/2/)
Step 4: Extracting Author Details
Each author name is a link to their detail page. Let’s follow those links and extract more information:quotes_spider.py
meta={'quote_data': quote_data}: Passes data between callbacksresponse.request.meta['quote_data']: Retrieves the passed data- Multiple callbacks:
parse()for quotes,parse_author()for author pages
Step 5: Adding Configuration
Let’s add some spider configuration to control crawling behavior:quotes_spider.py
Step 6: Data Processing & Export
Add data cleaning and export functionality:quotes_spider.py
Step 7: Adding Pause & Resume
For long-running crawls, enable checkpointing:quotes_spider.py
Step 8: Using Different Fetchers
By default, spiders use HTTP requests. For JavaScript-heavy sites, use browser sessions:advanced_spider.py
Complete Example
Here’s our final, production-ready spider:quotes_spider.py
Testing Your Spider
Before running on the full site, test with a single URL:test_spider.py
Next Steps
You now have a solid foundation for building web scrapers with Scrapling! Here’s what to explore next:- Using multiple session types - Combine HTTP and browser sessions
- Proxy rotation - Rotate proxies and handle blocking
- Advanced features - Streaming, error handling, and more
- Real-world examples - Production-ready spider examples