Checkpoint Pause/Resume

ScrapAI automatically enables checkpoint support for production crawls, allowing you to pause long-running crawls and resume them later without losing progress.

How It Works

Automatic for production crawls

Checkpoint is automatically enabled when running production crawls (no --limit flag)

Press Ctrl+C to pause

Checkpoint saved automatically when interrupted

Run same command to resume

Automatically detects checkpoint and resumes from where you left off

Automatic cleanup on success

Checkpoint deleted automatically on successful completion

Test crawls (with --limit) do not use checkpoints since they’re short-running.

What Gets Saved

Scrapy’s JOBDIR feature saves:

Pending requests - All URLs waiting to be crawled
Duplicates filter - URLs already visited (prevents re-crawling)
Spider state - Any custom state stored in spider.state dict

Usage

Production Crawl with Checkpoint

# Start production crawl (checkpoint auto-enabled)
./scrapai crawl myspider --project myproject

Console output:

💾 Checkpoint enabled: ./data/myproject/myspider/checkpoint
Press Ctrl+C to pause, run same command to resume

Pause the crawl:

# Press Ctrl+C
^C

Resume later:

# Run same command
./scrapai crawl myspider --project myproject
# Automatically detects checkpoint and resumes

Test Crawl (No Checkpoint)

# Test mode - no checkpoint needed (short run)
./scrapai crawl myspider --project myproject --limit 10

Console output:

🧪 Test mode: Saving to database (limit: 10 items)

Checkpoint Storage

Checkpoints are stored in your DATA_DIR:

DATA_DIR/<project>/<spider>/checkpoint/

Example directory structure:

./data/myproject/myspider/
├── analysis/        # Phase 1-3 files
├── crawls/          # Production outputs
├── exports/         # Database exports
└── checkpoint/      # Checkpoint state (auto-cleaned on success)

Automatic Cleanup

Automatic cleanup on successful completion:

When spider completes successfully (no Ctrl+C), checkpoint directory is automatically deleted
Saves disk space
Only failed/interrupted crawls keep checkpoints

Manual cleanup:

# If you want to discard a checkpoint and start fresh
rm -rf ./data/myproject/myspider/checkpoint/

Complete Example

Start production crawl

./scrapai crawl techcrunch --project news

Output:

💾 Checkpoint enabled: ./data/news/techcrunch/checkpoint
Press Ctrl+C to pause, run same command to resume

Crawling: https://techcrunch.com/
Scraped 50 items...
Scraped 100 items...

Pause crawl (Ctrl+C)

^C

Output:

Received interrupt signal, shutting down...
Checkpoint saved: 150 items scraped, 237 URLs pending
Run same command to resume from checkpoint

Check checkpoint exists

ls -la ./data/news/techcrunch/checkpoint/

Output:

drwxr-xr-x  5 user  staff   160 Feb 24 10:30 .
drwxr-xr-x  7 user  staff   224 Feb 24 10:15 ..
-rw-r--r--  1 user  staff  4096 Feb 24 10:30 requests.queue
-rw-r--r--  1 user  staff  8192 Feb 24 10:30 dupefilter.db
-rw-r--r--  1 user  staff   512 Feb 24 10:30 spider.state

Resume crawl

./scrapai crawl techcrunch --project news

Output:

♻️  Resuming from checkpoint: 150 items already scraped, 237 URLs pending

Continuing crawl...
Scraped 160 items...
Scraped 200 items...
Crawl completed successfully!

🧹 Checkpoint cleaned up (crawl completed)

Limitations

Request callbacks must be spider methods (Scrapy limitation):

# ✅ Works (spider method)
Request(url, callback=self.parse_article)

# ❌ Won't work (external function)
Request(url, callback=some_external_function)

✅ ScrapAI spiders already compatible: Our database spiders use spider methods (self.parse), so checkpoints work out of the box!

Other limitations:

Cookie expiration: If you wait too long to resume (days/weeks), cookies may expire and requests may fail. Resume within a reasonable timeframe (hours/days, not weeks).
Multiple runs: Each spider should have only one checkpoint at a time. Don’t run the same spider concurrently while a checkpoint exists.
Proxy type changes: If you change --proxy-type when resuming, the checkpoint is automatically cleared (see below).

Proxy Type Changes (Expert-in-the-Loop)

If you change --proxy-type when resuming, the checkpoint is automatically cleared and crawl starts fresh.

Example scenario:

Start crawl with auto mode

./scrapai crawl myspider --project proj

Uses datacenter proxies (auto mode default)

Datacenter fails, get expert prompt

⚠️  EXPERT-IN-THE-LOOP: Datacenter proxy failed
🏠 To use residential proxy, run:
  ./scrapai crawl myspider --project proj --proxy-type residential

Press Ctrl+C to pause

Resume with residential proxy

./scrapai crawl myspider --project proj --proxy-type residential

Output:

⚠️  Proxy type changed: auto → residential
🗑️  Clearing checkpoint to ensure all URLs retried with residential proxy
♻️  Starting fresh crawl

Why checkpoint is cleared:

Ensures blocked URLs are retried with new proxy type
Prevents Scrapy’s dupefilter from skipping already-seen failed URLs
Simpler and safer than complex retry logic
User explicitly chose expensive residential proxy, accepts comprehensive re-crawl

When Checkpoints Are Useful

Useful For
Not Needed For

✅ Long-running crawls (hours/days) Resume if interrupted✅ Unstable connections Resume after network failures✅ System maintenance Pause before server restart, resume after✅ Resource management Pause during high-load periods, resume later

❌ Short test crawls (minutes) Not needed, checkpoints disabled❌ Quick prototyping Use --limit flag, no checkpoints

Technical Details

Built on Scrapy’s JOBDIR:

Uses Scrapy’s native pause/resume feature (not custom implementation)
Checkpoint files are pickle-serialized Scrapy objects
Atomic writes prevent checkpoint corruption
Compatible with all Scrapy spiders

Directory per spider:

Each spider gets its own checkpoint directory
Prevents conflicts between spiders
Clean separation of state

Smart cleanup:

Exit code 0 (success) → cleanup checkpoint
Exit code != 0 (error/Ctrl+C) → keep checkpoint for resume

Troubleshooting

Checkpoint Not Resuming

Check if checkpoint exists

ls -la ./data/myproject/myspider/checkpoint/

If directory doesn’t exist:

Checkpoint was cleaned up (successful completion)
Or never created (test mode with --limit)

Verify same command

Must use exact same command to resume:

# Original
./scrapai crawl myspider --project proj --proxy-type datacenter

# Resume (same command)
./scrapai crawl myspider --project proj --proxy-type datacenter

Check for proxy type change

Changing --proxy-type clears checkpoint automatically

Start Fresh (Discard Checkpoint)

# Delete checkpoint directory
rm -rf ./data/myproject/myspider/checkpoint/

# Run crawl again
./scrapai crawl myspider --project myproject

Checkpoint from Old Spider Version

If you updated spider rules/selectors significantly, old checkpoint may be incompatible.

Solution:

# Delete checkpoint and start fresh
rm -rf ./data/myproject/myspider/checkpoint/
./scrapai crawl myspider --project myproject

Checkpoint Files Too Large

Check size:

du -sh ./data/myproject/myspider/checkpoint/

Large checkpoints indicate:

Many pending URLs (normal for large crawls)
Consider crawling in smaller batches
Or use incremental crawling (DeltaFetch)

Incremental Crawling

Skip unchanged pages on subsequent crawls

Queue Processing

Batch process multiple websites

Proxy Escalation

Smart proxy usage with cost control

Get Started

Core Concepts

Guides

AI Agents

Configuration

Advanced

Checkpoint Pause/Resume

How It Works

What Gets Saved

Usage

Production Crawl with Checkpoint

Test Crawl (No Checkpoint)

Checkpoint Storage

Automatic Cleanup

Complete Example

Limitations

Proxy Type Changes (Expert-in-the-Loop)

When Checkpoints Are Useful

Technical Details

Troubleshooting

Checkpoint Not Resuming

Start Fresh (Discard Checkpoint)

Checkpoint from Old Spider Version

Checkpoint Files Too Large

Incremental Crawling

Queue Processing

Proxy Escalation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

AI Agents

Configuration

Advanced

​How It Works

​What Gets Saved

​Usage

​Production Crawl with Checkpoint

​Test Crawl (No Checkpoint)

​Checkpoint Storage

​Automatic Cleanup

​Complete Example

​Limitations

​Proxy Type Changes (Expert-in-the-Loop)

​When Checkpoints Are Useful

​Technical Details

​Troubleshooting

​Checkpoint Not Resuming

​Start Fresh (Discard Checkpoint)

​Checkpoint from Old Spider Version

​Checkpoint Files Too Large

​Related Guides

Incremental Crawling

Queue Processing

Proxy Escalation

Build docs developers (and LLMs) love

How It Works

What Gets Saved

Usage

Production Crawl with Checkpoint

Test Crawl (No Checkpoint)

Checkpoint Storage

Automatic Cleanup

Complete Example

Limitations

Proxy Type Changes (Expert-in-the-Loop)

When Checkpoints Are Useful

Technical Details

Troubleshooting

Checkpoint Not Resuming

Start Fresh (Discard Checkpoint)

Checkpoint from Old Spider Version

Checkpoint Files Too Large

Related Guides