Cloudflare Bypass

Only enable Cloudflare bypass when the site explicitly requires it. Always test WITHOUT --cloudflare first.

Detection Indicators

Your site needs Cloudflare bypass if you see:

“Checking your browser” or “Just a moment” messages
403/503 HTTP errors with Cloudflare branding
Challenge pages before content loads

Display Requirements

Cloudflare bypass requires a visible browser (not headless). Cloudflare detects and blocks headless browsers.

Platform support:

Windows: Uses native display automatically ✓
macOS: Uses native display automatically ✓
Linux desktop: Uses native display automatically ✓
Linux servers (VPS without GUI): Auto-detects missing display and uses Xvfb (virtual display) ✓

Installing Xvfb on Linux servers:

sudo apt-get install xvfb

The crawler automatically detects your environment and uses Xvfb when no display is available on Linux.

Inspector Usage

Start with default HTTP (fast)

Works for most sites:

./scrapai inspect https://example.com --project proj

Try browser mode if JS-rendered

For JavaScript-heavy sites:

./scrapai inspect https://example.com --project proj --browser

Use Cloudflare bypass only when blocked

For Cloudflare-protected sites:

./scrapai inspect https://example.com --project proj --cloudflare

Smart resource usage: Start with default HTTP (fast). Escalate to --browser if content is JS-rendered. Use --cloudflare only when seeing challenge pages or 403/503 errors.

Strategies

Hybrid Mode (Recommended)

Browser verification once per 10 minutes, then fast HTTP with cached cookies. 20-100x faster than browser-only mode.

Do NOT set CONCURRENT_REQUESTS - uses Scrapy default of 16 for optimal performance.

spider.json

{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid",
    "CLOUDFLARE_COOKIE_REFRESH_THRESHOLD": 600,
    "CF_MAX_RETRIES": 5,
    "CF_RETRY_INTERVAL": 1,
    "CF_POST_DELAY": 5
  }
}

How it works:

Browser verifies Cloudflare once and caches cookies
Subsequent requests use fast HTTP with cached cookies
Auto-refreshes cookies every 10 minutes
Falls back to browser if cookies become invalid

Browser-Only Mode (Legacy)

Only use if hybrid mode fails. Browser for every request. Much slower.Requires CONCURRENT_REQUESTS: 1 to prevent browser conflicts.

spider.json

{
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "browser_only",
    "CONCURRENT_REQUESTS": 1
  }
}

Settings Reference

Setting	Default	Description
`CLOUDFLARE_ENABLED`	false	Enable CF bypass
`CLOUDFLARE_STRATEGY`	”hybrid"	"hybrid” or “browser_only”
`CLOUDFLARE_COOKIE_REFRESH_THRESHOLD`	600	Seconds before cookie refresh
`CF_MAX_RETRIES`	5	Max verification attempts
`CF_RETRY_INTERVAL`	1	Seconds between retries
`CF_POST_DELAY`	5	Seconds after successful verification
`CF_WAIT_SELECTOR`	—	CSS selector to wait for before extracting
`CF_WAIT_TIMEOUT`	10	Max seconds to wait for selector
`CF_PAGE_TIMEOUT`	120000	Page navigation timeout (ms)
`CONCURRENT_REQUESTS`	—	Must be 1 for browser-only mode

Complete Spider Example

spider.json

{
  "name": "mysite",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://www.example.com/articles"],
  "rules": [
    {
      "allow": ["/article/[^/]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 100
    },
    {
      "allow": ["/articles/"],
      "callback": null,
      "follow": true,
      "priority": 50
    }
  ],
  "settings": {
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid",
    "CLOUDFLARE_COOKIE_REFRESH_THRESHOLD": 600,
    "CF_MAX_RETRIES": 5,
    "CF_RETRY_INTERVAL": 1,
    "CF_POST_DELAY": 5,
    "CF_WAIT_SELECTOR": "h1.title-med-1",
    "DOWNLOAD_DELAY": 2
  }
}

Timeouts & Hang Prevention

Browser operation timeout: 300 seconds (5 minutes) per operation to prevent infinite hangs.

If a browser operation exceeds 300 seconds, the crawl fails with a TimeoutError instead of hanging forever. This protects against:

Browser subprocess hangs
Network stalls
Infinite CF challenge loops
Cross-thread asyncio deadlocks

Typical operation times:

CF verification: 10-60 seconds
Page load: 5-30 seconds
Cookie refresh: 10-30 seconds

If you consistently hit the 300s timeout, investigate:

Network connectivity issues
Site blocking your IP/region
Browser/Chrome subprocess problems
System resource constraints (CPU/memory)

Troubleshooting

Crawl Hangs at “Getting/refreshing CF cookies”

Symptoms: Browser opens but never navigates. Logs show “Getting/refreshing CF cookies” but no progress. Possible causes:

Asyncio event loop mismatch (fixed in latest version)
Browser subprocess issues - Chrome/nodriver incompatible with thread-based event loop
Display/X11 issues on Linux servers
Network/firewall blocking browser traffic

Solutions:

Update to latest version

Ensure you’re on latest version with timeout fix

Verify browser opens

Check browser actually opens (not headless failing)

Check display (Linux servers)

Verify Xvfb is installed: sudo apt-get install xvfb

Test with inspector

Test with --cloudflare flag on inspector first:

./scrapai inspect https://example.com --project proj --cloudflare

Check system resources

Verify CPU, memory, and disk space availability

Works on One Machine But Not Another

Environmental factors affecting browser subprocesses:

Python/asyncio version differences
Display environment (X11 vs Wayland vs headless)
Chrome/Chromium version and availability
System resources and timing (race conditions)
Network conditions (DNS, latency, firewalls)
Security software interfering with browser

Debugging steps:

Test inspector on both machines

./scrapai inspect https://example.com --project proj --cloudflare

Check Chrome installation

google-chrome --version

Verify display (Linux)

echo $DISPLAY  # Should show :99 with Xvfb

Review logs for errors

Check logs for specific error messages

Try different strategy

Switch between hybrid and browser_only modes

Diagnosing via Logs

Hybrid mode indicators:

Cached N cookies (cf_clearance: ...)
// Cookies working properly

Browser-only mode indicators:

Cloudflare verified successfully
Opened persistent browser
Closed browser
// Normal lifecycle

Title Contamination

If extracted titles show wrong text (e.g., “Related Articles” instead of actual title), set CF_WAIT_SELECTOR to the main title element.

{
  "settings": {
    "CF_WAIT_SELECTOR": "h1.article-title"
  }
}

This captures HTML before related content loads, preventing contamination.

Proxy Escalation

Combine with smart proxy usage

Checkpoint Resume

Pause and resume long crawls

Get Started

Core Concepts

Guides

AI Agents

Configuration

Advanced

Cloudflare Bypass

Detection Indicators

Display Requirements

Inspector Usage

Strategies

Hybrid Mode (Recommended)

Browser-Only Mode (Legacy)

Settings Reference

Complete Spider Example

Timeouts & Hang Prevention

Troubleshooting

Crawl Hangs at “Getting/refreshing CF cookies”

Works on One Machine But Not Another

Diagnosing via Logs

Title Contamination

Proxy Escalation

Checkpoint Resume

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

AI Agents

Configuration

Advanced

​Detection Indicators

​Display Requirements

​Inspector Usage

​Strategies

​Hybrid Mode (Recommended)

​Browser-Only Mode (Legacy)

​Settings Reference

​Complete Spider Example

​Timeouts & Hang Prevention

​Troubleshooting

​Crawl Hangs at “Getting/refreshing CF cookies”

​Works on One Machine But Not Another

​Diagnosing via Logs

​Title Contamination

​Related Guides

Proxy Escalation

Checkpoint Resume

Build docs developers (and LLMs) love

Detection Indicators

Display Requirements

Inspector Usage

Strategies

Hybrid Mode (Recommended)

Browser-Only Mode (Legacy)

Settings Reference

Complete Spider Example

Timeouts & Hang Prevention

Troubleshooting

Crawl Hangs at “Getting/refreshing CF cookies”

Works on One Machine But Not Another

Diagnosing via Logs

Title Contamination

Related Guides