Prerequisites
- You’ve completed or read the Fetchers basics page to understand the different fetcher types and when to use each one.
- You’ve completed or read the Main classes page to understand the Selector and Response classes.
Data Flow
Here’s what happens step by step when you run a spider:-
The Spider produces the first batch of
Requestobjects. By default, it creates one request for each URL instart_urls, but you can overridestart_requests()for custom logic. - The Scheduler receives requests and places them in a priority queue, creating fingerprints for deduplication. Higher-priority requests are dequeued first.
-
The Crawler Engine asks the Scheduler to dequeue the next request, respecting concurrency limits (global and per-domain) and download delays. Once received, it passes the request to the Session Manager, which routes it to the correct session based on the request’s
sid(session ID). -
The session fetches the page and returns a Response object to the Crawler Engine. The engine records statistics and checks for blocked responses. If the response is blocked, the engine retries the request up to
max_blocked_retriestimes. The blocking detection and retry logic can be customized. - The Crawler Engine passes the Response to the request’s callback. The callback either yields a dictionary (treated as a scraped item) or a follow-up request (sent to the scheduler for queuing).
- The cycle repeats from step 2 until the scheduler is empty and no tasks are active, or the spider is paused.
-
If
crawldiris set, the Crawler Engine periodically saves a checkpoint (pending requests + seen URLs set) to disk. On graceful shutdown (Ctrl+C), a final checkpoint is saved. The next time the spider runs with the samecrawldir, it resumes from where it left off — skippingstart_requests()and restoring the scheduler state.
Components
Spider
The central class you interact with. You subclassSpider, define your start_urls and parse() method, and optionally configure sessions and override lifecycle hooks.
spider.py:36-48
| Attribute | Type | Default | Description |
|---|---|---|---|
name | str | Required | Unique identifier for the spider |
start_urls | list[str] | [] | List of URLs to start crawling from |
allowed_domains | set[str] | set() | Restrict crawling to these domains |
concurrent_requests | int | 4 | Maximum concurrent requests |
concurrent_requests_per_domain | int | 0 | Per-domain concurrency limit (0 = disabled) |
download_delay | float | 0.0 | Seconds to wait before each request |
max_blocked_retries | int | 3 | Retry attempts for blocked requests |
| Attribute | Default | Description |
|---|---|---|
fp_include_kwargs | False | Include request kwargs in fingerprint |
fp_keep_fragments | False | Keep URL fragments (#section) in fingerprint |
fp_include_headers | False | Include headers in fingerprint |
Crawler Engine
The engine orchestrates the entire crawl. It manages the main loop, enforces concurrency limits, dispatches requests through the Session Manager, and processes results from callbacks. You don’t interact with it directly — theSpider.start() and Spider.stream() methods handle it for you.
Key responsibilities:
- Manages the crawl lifecycle
- Enforces global and per-domain concurrency limits
- Handles download delays
- Detects and retries blocked requests
- Manages checkpoint saves/restores
- Collects statistics and logs
Scheduler
A priority queue with built-in URL deduplication. Requests are fingerprinted based on their URL, HTTP method, body, and session ID.scheduler.py:30-45
snapshot() and restore() for the checkpoint system, allowing the crawl state to be saved and resumed.
Session Manager
Manages one or more named session instances. Each session is one of: When a request comes in, the Session Manager routes it to the correct session based on the request’ssid field. Sessions can be started with the spider start (default) or lazily (started on the first use).
session.py:22-41
Checkpoint System
An optional system that, if enabled, saves the crawler’s state (pending requests + seen URL fingerprints) to a pickle file on disk.checkpoint.py:42-61
Output
Scraped items are collected in anItemList (a list subclass with to_json() and to_jsonl() export methods). Crawl statistics are tracked in a CrawlStats dataclass.
result.py:10-38
Comparison with Scrapy
If you’re coming from Scrapy, here’s how Scrapling’s spider system maps:| Concept | Scrapy | Scrapling |
|---|---|---|
| Spider definition | scrapy.Spider subclass | scrapling.spiders.Spider subclass |
| Initial requests | start_requests() | async start_requests() |
| Callbacks | def parse(self, response) | async def parse(self, response) |
| Following links | response.follow(url) | response.follow(url) |
| Item output | yield dict or yield Item | yield dict |
| Request scheduling | Scheduler + Dupefilter | Scheduler with built-in deduplication |
| Downloading | Downloader + Middlewares | Session Manager with multi-session support |
| Item processing | Item Pipelines | on_scraped_item() hook |
| Blocked detection | Through custom middlewares | Built-in is_blocked() + retry_blocked_request() hooks |
| Concurrency | CONCURRENT_REQUESTS setting | concurrent_requests class attribute |
| Domain filtering | allowed_domains | allowed_domains |
| Pause/Resume | JOBDIR setting | crawldir constructor argument |
| Export | Feed exports | result.items.to_json() / to_jsonl() or custom through hooks |
| Running | scrapy crawl spider_name | MySpider().start() |
| Streaming | N/A | async for item in spider.stream() |
| Multi-session | N/A | Multiple sessions with different types per spider |