Prerequisites
- You’ve read the Getting started page and know how to create and run a basic spider.
Quick Start
Enable checkpointing by passing acrawldir to your spider:
Ctrl+C during the crawl to pause. Run the same code again to resume from where it stopped.
How It Works
The checkpoint system saves the spider’s state to disk at regular intervals and when you pause the crawl. The state includes:- Pending requests — All requests that haven’t been processed yet (in the scheduler’s priority queue)
- Seen URLs — A set of request fingerprints to prevent duplicate crawling
Checkpoint Lifecycle
-
Initial crawl: Spider starts normally, generating requests from
start_urls -
Periodic saves: Every 5 minutes (configurable), the checkpoint is saved to
crawldir/checkpoint.pkl -
Graceful pause: Press
Ctrl+Conce. The spider:- Stops accepting new requests from the scheduler
- Waits for all in-flight requests to complete
- Saves a final checkpoint
- Exits cleanly
-
Force stop: Press
Ctrl+Cagain to stop immediately without waiting -
Resume: Run the spider again with the same
crawldir. It:- Detects the existing checkpoint file
- Restores the pending requests and seen URLs
- Skips
start_requests()(since we already have requests queued) - Continues crawling from where it left off
- Completion: When the crawl finishes normally (scheduler empty, no active tasks), checkpoint files are deleted automatically
Checkpoint Configuration
Setting the Save Interval
By default, checkpoints are saved every 5 minutes (300 seconds). You can customize this:interval parameter is in seconds.
Checkpoint Storage
Checkpoints are stored in the directory specified bycrawldir:
Checkpoint Implementation
The checkpoint system is implemented incheckpoint.py:
checkpoint.py:15-21
Atomic Saves
Checkpoint writes are atomic to prevent corruption if the process is killed during a save:checkpoint.py:42-61
.tmp) first, then atomically renames it to the final checkpoint file. This ensures the checkpoint is always in a valid state.
Loading Checkpoints
checkpoint.py:63-81
Engine Integration
The crawler engine manages the checkpoint lifecycle:Checking for Checkpoints
engine.py:222-233
Restoring from Checkpoint
engine.py:202-220
request.py:154-163
Periodic Checkpoint Saves
engine.py:191-200
engine.py:273-274
Graceful Pause Handling
engine.py:250-267
Scheduler State Management
The scheduler implementssnapshot() and restore() for checkpointing:
scheduler.py:60-64
scheduler.py:66-80
Detecting Resume in Your Spider
Theon_start() hook receives a resuming flag so you can perform different initialization logic:
Best Practices
Use Unique crawldir per Spider
Adjust Interval Based on Crawl Size
Handle Checkpoint Failures Gracefully
If a checkpoint load fails (corrupted file, version incompatibility), the spider starts fresh. To handle this explicitly:Clean Up After Completion
Checkpoints are automatically deleted when a crawl completes successfully:engine.py:297-301
Troubleshooting
Checkpoint Not Loading
Symptoms: Spider starts from scratch even though checkpoint exists Possible causes:- Wrong
crawldirpath - Corrupted checkpoint file
- Pickle version mismatch (Python version changed)
- Spider code changed significantly (callback names changed)
Checkpoint Too Large
Symptoms: Slow checkpoint saves, large disk usage Causes: Very large crawl with millions of URLs Solutions:-
Increase save interval to reduce I/O:
-
Use
allowed_domainsto limit scope: -
Increase
download_delayto crawl slower and accumulate fewer pending requests:
Memory Issues After Resume
Symptoms: High memory usage after resuming Cause: Large number of pending requests loaded into memory Solution: The scheduler uses an asyncio.PriorityQueue which is memory-efficient, but if you have millions of pending requests, consider splitting your crawl into smaller jobs with differentstart_urls.
Example: Long-Running Crawler with Checkpointing
Ctrl+C during the crawl to pause. Run the same script again to resume from where it stopped.