Only enable Cloudflare bypass when the site explicitly requires it. Always test WITHOUT --cloudflare first.
Detection Indicators
Your site needs Cloudflare bypass if you see:
“Checking your browser” or “Just a moment” messages
403/503 HTTP errors with Cloudflare branding
Challenge pages before content loads
Display Requirements
Cloudflare bypass requires a visible browser (not headless). Cloudflare detects and blocks headless browsers.
Platform support:
Windows: Uses native display automatically ✓
macOS: Uses native display automatically ✓
Linux desktop: Uses native display automatically ✓
Linux servers (VPS without GUI): Auto-detects missing display and uses Xvfb (virtual display) ✓
Installing Xvfb on Linux servers:
sudo apt-get install xvfb
The crawler automatically detects your environment and uses Xvfb when no display is available on Linux.
Inspector Usage
Start with default HTTP (fast)
Works for most sites: ./scrapai inspect https://example.com --project proj
Try browser mode if JS-rendered
For JavaScript-heavy sites: ./scrapai inspect https://example.com --project proj --browser
Use Cloudflare bypass only when blocked
For Cloudflare-protected sites: ./scrapai inspect https://example.com --project proj --cloudflare
Smart resource usage: Start with default HTTP (fast). Escalate to --browser if content is JS-rendered. Use --cloudflare only when seeing challenge pages or 403/503 errors.
Strategies
Hybrid Mode (Recommended)
Browser verification once per 10 minutes, then fast HTTP with cached cookies. 20-100x faster than browser-only mode.
Do NOT set CONCURRENT_REQUESTS - uses Scrapy default of 16 for optimal performance.
{
"settings" : {
"CLOUDFLARE_ENABLED" : true ,
"CLOUDFLARE_STRATEGY" : "hybrid" ,
"CLOUDFLARE_COOKIE_REFRESH_THRESHOLD" : 600 ,
"CF_MAX_RETRIES" : 5 ,
"CF_RETRY_INTERVAL" : 1 ,
"CF_POST_DELAY" : 5
}
}
How it works:
Browser verifies Cloudflare once and caches cookies
Subsequent requests use fast HTTP with cached cookies
Auto-refreshes cookies every 10 minutes
Falls back to browser if cookies become invalid
Browser-Only Mode (Legacy)
Only use if hybrid mode fails. Browser for every request. Much slower. Requires CONCURRENT_REQUESTS: 1 to prevent browser conflicts.
{
"settings" : {
"CLOUDFLARE_ENABLED" : true ,
"CLOUDFLARE_STRATEGY" : "browser_only" ,
"CONCURRENT_REQUESTS" : 1
}
}
Settings Reference
Setting Default Description CLOUDFLARE_ENABLEDfalse Enable CF bypass CLOUDFLARE_STRATEGY”hybrid" "hybrid” or “browser_only” CLOUDFLARE_COOKIE_REFRESH_THRESHOLD600 Seconds before cookie refresh CF_MAX_RETRIES5 Max verification attempts CF_RETRY_INTERVAL1 Seconds between retries CF_POST_DELAY5 Seconds after successful verification CF_WAIT_SELECTOR— CSS selector to wait for before extracting CF_WAIT_TIMEOUT10 Max seconds to wait for selector CF_PAGE_TIMEOUT120000 Page navigation timeout (ms) CONCURRENT_REQUESTS— Must be 1 for browser-only mode
Complete Spider Example
{
"name" : "mysite" ,
"allowed_domains" : [ "example.com" ],
"start_urls" : [ "https://www.example.com/articles" ],
"rules" : [
{
"allow" : [ "/article/[^/]+$" ],
"callback" : "parse_article" ,
"follow" : false ,
"priority" : 100
},
{
"allow" : [ "/articles/" ],
"callback" : null ,
"follow" : true ,
"priority" : 50
}
],
"settings" : {
"CLOUDFLARE_ENABLED" : true ,
"CLOUDFLARE_STRATEGY" : "hybrid" ,
"CLOUDFLARE_COOKIE_REFRESH_THRESHOLD" : 600 ,
"CF_MAX_RETRIES" : 5 ,
"CF_RETRY_INTERVAL" : 1 ,
"CF_POST_DELAY" : 5 ,
"CF_WAIT_SELECTOR" : "h1.title-med-1" ,
"DOWNLOAD_DELAY" : 2
}
}
Timeouts & Hang Prevention
Browser operation timeout: 300 seconds (5 minutes) per operation to prevent infinite hangs.
If a browser operation exceeds 300 seconds, the crawl fails with a TimeoutError instead of hanging forever. This protects against:
Browser subprocess hangs
Network stalls
Infinite CF challenge loops
Cross-thread asyncio deadlocks
Typical operation times:
CF verification: 10-60 seconds
Page load: 5-30 seconds
Cookie refresh: 10-30 seconds
If you consistently hit the 300s timeout, investigate:
Network connectivity issues
Site blocking your IP/region
Browser/Chrome subprocess problems
System resource constraints (CPU/memory)
Troubleshooting
Crawl Hangs at “Getting/refreshing CF cookies”
Symptoms: Browser opens but never navigates. Logs show “Getting/refreshing CF cookies” but no progress.
Possible causes:
Asyncio event loop mismatch (fixed in latest version)
Browser subprocess issues - Chrome/nodriver incompatible with thread-based event loop
Display/X11 issues on Linux servers
Network/firewall blocking browser traffic
Solutions:
Update to latest version
Ensure you’re on latest version with timeout fix
Verify browser opens
Check browser actually opens (not headless failing)
Check display (Linux servers)
Verify Xvfb is installed: sudo apt-get install xvfb
Test with inspector
Test with --cloudflare flag on inspector first: ./scrapai inspect https://example.com --project proj --cloudflare
Check system resources
Verify CPU, memory, and disk space availability
Works on One Machine But Not Another
Environmental factors affecting browser subprocesses:
Python/asyncio version differences
Display environment (X11 vs Wayland vs headless)
Chrome/Chromium version and availability
System resources and timing (race conditions)
Network conditions (DNS, latency, firewalls)
Security software interfering with browser
Debugging steps:
Test inspector on both machines
./scrapai inspect https://example.com --project proj --cloudflare
Check Chrome installation
Verify display (Linux)
echo $DISPLAY # Should show :99 with Xvfb
Review logs for errors
Check logs for specific error messages
Try different strategy
Switch between hybrid and browser_only modes
Diagnosing via Logs
Hybrid mode indicators:
Cached N cookies (cf_clearance: ...)
// Cookies working properly
Browser-only mode indicators:
Cloudflare verified successfully
Opened persistent browser
Closed browser
// Normal lifecycle
Title Contamination
If extracted titles show wrong text (e.g., “Related Articles” instead of actual title), set CF_WAIT_SELECTOR to the main title element.
{
"settings" : {
"CF_WAIT_SELECTOR" : "h1.article-title"
}
}
This captures HTML before related content loads, preventing contamination.
Proxy Escalation Combine with smart proxy usage
Checkpoint Resume Pause and resume long crawls