CrawlerEngine class is the core component that orchestrates the entire crawling process. It manages request scheduling, concurrency, rate limiting, checkpoint/resume functionality, and item collection.
This class is typically used internally by the Spider framework. You usually don’t instantiate it directly.
Class Definition
Constructor
The spider instance to run.
Session manager containing configured sessions.
Directory for checkpoint files. If None, checkpointing is disabled.
Seconds between periodic checkpoint saves (default 5 minutes). Set to 0 to disable periodic checkpoints.
Attributes
Reference to the spider being executed.
Session manager for handling requests.
Request scheduler with duplicate filtering.
Current crawl statistics.
Whether the crawl was paused (vs. completed normally).
Methods
crawl
CrawlStats object with detailed crawl metrics
Process flow:
- Check for existing checkpoint and restore if found
- Call
spider.on_start(resuming=bool) - Generate initial requests from
spider.start_requests()(if not resuming) - Process requests concurrently with rate limiting
- Handle responses through callbacks
- Save periodic checkpoints (if enabled)
- Call
spider.on_close()on completion - Clean up checkpoint files on successful completion
request_pause
- First call: Requests graceful pause (waits for active tasks to complete)
- Second call: Forces immediate stop (cancels active tasks)
This method is called automatically when the user presses Ctrl+C.
items
ItemList containing all scraped items
Internal Methods
_process_request
- Rate limiting (global and per-domain)
- Download delay
- Session fetching
- Blocked request detection and retry
- Callback execution
- Item processing
- Error handling
_save_checkpoint
- Pending requests in the scheduler
- Seen request fingerprints
_restore_from_checkpoint
_is_domain_allowed
spider.allowed_domains.
Returns: True if allowed (or if allowed_domains is empty)
_normalize_request
sid to the default session ID.
Async Iteration
The engine supports async iteration for streaming items:Spider.stream().
Usage Examples
Direct Engine Usage (Advanced)
Streaming Items
Monitoring Progress
Concurrency Control
The engine manages concurrency at two levels:Global Concurrency
CapacityLimiter - limits total active requests.
Per-Domain Concurrency
CapacityLimiter - prevents overwhelming specific servers.
Download Delay
Checkpoint System
Whencrawldir is provided, the engine automatically saves checkpoints:
Checkpoint Timing
- Periodic saves: Every
intervalseconds (default 300) - Graceful pause: When
request_pause()is called - SIGINT handler: Automatic on Ctrl+C
Checkpoint Contents
Checkpoints store:- Pending requests: All requests still in the scheduler queue
- Seen fingerprints: Set of request fingerprints to avoid re-fetching
Resume Behavior
On resume:- Skips
spider.start_requests() - Restores pending requests to scheduler
- Continues from where it left off
- Calls
spider.on_start(resuming=True)
Error Handling
The engine handles errors at multiple levels:Request Errors
Callback Errors
Blocked Request Handling
Performance Metrics
The engine tracks comprehensive statistics inCrawlStats:
- requests_count: Total requests made
- failed_requests_count: Failed requests
- blocked_requests_count: Detected blocked requests
- offsite_requests_count: Filtered offsite requests
- items_scraped: Items yielded and accepted
- items_dropped: Items dropped by
on_scraped_item - response_bytes: Total bytes downloaded
- domains_response_bytes: Per-domain bandwidth
- sessions_requests_count: Requests per session
- response_status_count: Status code distribution
- elapsed_seconds: Total crawl duration
- requests_per_second: Throughput rate
See Also
- Spider - Spider class documentation
- Request - Request object details
- SessionManager - Session management
- CrawlResult - Result object structure