How It Works
Auto Mode Strategy (default):Smart cost control:
- Direct connections are faster and free
- Proxies only used when necessary
- Datacenter proxies preferred (cheaper)
- Residential proxies require explicit user approval
- Learns per-domain blocking patterns
- Reduces proxy bandwidth costs by 80-90%
- No surprise costs - expensive proxies need human approval
Setup
Add proxy credentials to.env:
Usage
Auto Mode (Default) - Expert-in-the-Loop
- Default Behavior
- Explicit Auto
Explicit Proxy Modes
All modes follow smart strategy:
- ✅ Start with direct connections (fast, free)
- ✅ Only use proxy when blocked (403/429 errors)
- ✅ Learn which domains need proxies
- ✅ Use proxy proactively for blocked domains
--proxy-type flag controls escalation behavior and cost limits.Configuration
The middleware is enabled by default insettings.py with priority 350.
Statistics Tracking
SmartProxyMiddleware tracks detailed usage statistics:- Direct requests - Connections without proxy
- Proxy requests - Connections using proxy
- Blocked retries - Requests that hit 403/429 and retried with proxy
- Blocked domains - Domains that consistently need proxies
Implementation Details
Middleware Logic (from source:middlewares.py):
Proxy URL Format:
- File:
middlewares.py:16-277 - Class:
SmartProxyMiddleware - Priority: 350 (configured in
settings.py) - Type: Scrapy downloader middleware
Proxy Providers
SmartProxyMiddleware works with any HTTP proxy provider:Datacenter Proxies (Recommended)
Best for most use cases:Decodo
⭐ Recommended - Residential and datacenter proxies, good value
Bright Data
Industry leader
Oxylabs
High quality
IPRoyal
Budget friendly
Residential Proxies
For sites that block datacenter IPs:Troubleshooting
Proxy Not Being Used
Still Getting Blocked with Proxy
Proxy Costs Too High
SmartProxyMiddleware should already minimize costs by using direct connections first.
- Check which domains are marked as blocked (in stats at spider close)
- Verify those domains actually need proxies
- Consider if site has changed and unblocking is possible
- Some sites may require proxies for all requests - this is expected
When to Mention to Users
Recommend proxy setup when:- User asks about proxies or rate limiting
- Spider is getting blocked (403/429 errors in logs)
- User needs to scrape at scale (1000s of pages)
- User mentions proxy provider (Bright Data, Oxylabs, Smartproxy, etc.)
- Crawls are failing with “Access Denied” or “Too Many Requests”
Related Guides
Cloudflare Bypass
Handle Cloudflare-protected sites
Checkpoint Resume
Pause and resume long crawls