How It Works
Automatic for production crawls
Checkpoint is automatically enabled when running production crawls (no
--limit flag)Test crawls (with
--limit) do not use checkpoints since they’re short-running.What Gets Saved
Scrapy’s JOBDIR feature saves:- Pending requests - All URLs waiting to be crawled
- Duplicates filter - URLs already visited (prevents re-crawling)
- Spider state - Any custom state stored in
spider.statedict
Usage
Production Crawl with Checkpoint
Test Crawl (No Checkpoint)
Checkpoint Storage
Checkpoints are stored in your DATA_DIR:Automatic Cleanup
Automatic cleanup on successful completion:
- When spider completes successfully (no Ctrl+C), checkpoint directory is automatically deleted
- Saves disk space
- Only failed/interrupted crawls keep checkpoints
Complete Example
Limitations
Other limitations:
- Cookie expiration: If you wait too long to resume (days/weeks), cookies may expire and requests may fail. Resume within a reasonable timeframe (hours/days, not weeks).
- Multiple runs: Each spider should have only one checkpoint at a time. Don’t run the same spider concurrently while a checkpoint exists.
- Proxy type changes: If you change
--proxy-typewhen resuming, the checkpoint is automatically cleared (see below).
Proxy Type Changes (Expert-in-the-Loop)
Example scenario:
Why checkpoint is cleared:
- Ensures blocked URLs are retried with new proxy type
- Prevents Scrapy’s dupefilter from skipping already-seen failed URLs
- Simpler and safer than complex retry logic
- User explicitly chose expensive residential proxy, accepts comprehensive re-crawl
When Checkpoints Are Useful
- Useful For
- Not Needed For
✅ Long-running crawls (hours/days)
Resume if interrupted✅ Unstable connections
Resume after network failures✅ System maintenance
Pause before server restart, resume after✅ Resource management
Pause during high-load periods, resume later
Technical Details
Built on Scrapy’s JOBDIR:- Uses Scrapy’s native pause/resume feature (not custom implementation)
- Checkpoint files are pickle-serialized Scrapy objects
- Atomic writes prevent checkpoint corruption
- Compatible with all Scrapy spiders
- Each spider gets its own checkpoint directory
- Prevents conflicts between spiders
- Clean separation of state
- Exit code 0 (success) → cleanup checkpoint
- Exit code != 0 (error/Ctrl+C) → keep checkpoint for resume
Troubleshooting
Checkpoint Not Resuming
Check if checkpoint exists
- Checkpoint was cleaned up (successful completion)
- Or never created (test mode with
--limit)
Start Fresh (Discard Checkpoint)
Checkpoint from Old Spider Version
Solution:Checkpoint Files Too Large
Check size:- Many pending URLs (normal for large crawls)
- Consider crawling in smaller batches
- Or use incremental crawling (DeltaFetch)
Related Guides
Incremental Crawling
Skip unchanged pages on subsequent crawls
Queue Processing
Batch process multiple websites
Proxy Escalation
Smart proxy usage with cost control