Skip to main content
Scrapling provides multiple ways to optimize scraping performance, from resource blocking to concurrent requests and page pooling.

Resource Blocking

Block unnecessary resources to speed up page loads:

Disable Resources

Block fonts, images, media, and other non-essential resources:
from scrapling import StealthyFetcher

response = StealthyFetcher.fetch(
    'https://example.com',
    disable_resources=True  # Significant speed boost
)
Blocked resource types:
  • font - Web fonts
  • image - Images and icons
  • media - Videos and audio
  • beacon - Analytics beacons
  • object - Embedded objects
  • imageset - Responsive images
  • texttrack - Video subtitles
  • websocket - WebSocket connections
  • csp_report - CSP reports
  • stylesheet - CSS files
Source: scrapling/engines/constants.py:2-13

Block Specific Domains

Block analytics and tracking domains:
response = StealthyFetcher.fetch(
    'https://example.com',
    blocked_domains={
        'google-analytics.com',
        'facebook.com',
        'doubleclick.net',
        'googletagmanager.com',
        'hotjar.com',
    }
)
Subdomains are automatically matched: blocking example.com also blocks sub.example.com.
Source: scrapling/engines/_browsers/_stealth.py:199

Concurrency Settings

Concurrent Requests

Control how many requests run simultaneously:
from scrapling import Spider

class FastSpider(Spider):
    name = 'fast_spider'
    concurrent_requests = 16  # Process 16 requests at once (default: 4)
    
    start_urls = ['https://example.com']
    
    async def parse(self, response):
        yield {'title': response.css('title::text').get()}
Source: scrapling/spiders/spider.py:76

Per-Domain Concurrency

Limit concurrent requests per domain to avoid overwhelming servers:
class PoliteSpider(Spider):
    name = 'polite_spider'
    concurrent_requests = 16
    concurrent_requests_per_domain = 2  # Max 2 simultaneous requests per domain
    
    start_urls = ['https://example.com']
    
    async def parse(self, response):
        yield {'data': response.css('.content::text').get()}
Source: scrapling/spiders/spider.py:77

Download Delay

Add delay between requests to the same domain:
class ThrottledSpider(Spider):
    name = 'throttled_spider'
    download_delay = 2.0  # Wait 2 seconds between requests to same domain
    
    start_urls = ['https://example.com']
    
    async def parse(self, response):
        yield {'content': response.text}
Source: scrapling/spiders/spider.py:78

Page Pooling

Reuse browser pages instead of creating new ones:

Session Page Pooling

from scrapling import StealthySession

# Create session with page pool
with StealthySession(max_pages=5) as session:
    # Pages are reused from the pool
    for url in urls:
        response = session.fetch(url)
        # Process response
Configuration: scrapling/engines/_browsers/_validators.py:62

Pool Statistics

Monitor page pool usage:
with StealthySession(max_pages=10) as session:
    response = session.fetch('https://example.com')
    
    stats = session.get_pool_stats()
    print(f"Total pages: {stats['total_pages']}")
    print(f"Busy pages: {stats['busy_pages']}")
    print(f"Max pages: {stats['max_pages']}")
Source: scrapling/engines/_browsers/_base.py:125-131

Network Optimization

Skip Network Idle

Don’t wait for network to be completely idle:
response = StealthyFetcher.fetch(
    'https://example.com',
    network_idle=False  # Don't wait for network idle (default)
)
Only enable network_idle=True when you need to ensure all async requests complete. Source: scrapling/engines/_browsers/_stealth.py:52

Skip DOM Loading

For static content, skip waiting for JavaScript:
response = StealthyFetcher.fetch(
    'https://example.com',
    load_dom=False  # Don't wait for DOMContentLoaded
)
Disabling load_dom may result in incomplete content for JavaScript-heavy sites. Default is True.
Source: scrapling/engines/_browsers/_stealth.py:67

Reduce Wait Time

Minimize or remove post-load wait:
response = StealthyFetcher.fetch(
    'https://example.com',
    wait=0  # No wait after page loads (default)
)
Source: scrapling/engines/_browsers/_stealth.py:54

Timeout Optimization

Set appropriate timeouts:
response = StealthyFetcher.fetch(
    'https://example.com',
    timeout=15000  # 15 seconds (default: 30 seconds)
)
Lower timeouts fail faster on slow sites, higher timeouts give more time for complex pages. Source: scrapling/engines/_browsers/_stealth.py:53

Headless Mode

Always use headless mode in production:
response = StealthyFetcher.fetch(
    'https://example.com',
    headless=True  # Faster than headful (default)
)
Headful mode is 20-30% slower and should only be used for debugging. Source: scrapling/engines/_browsers/_stealth.py:46

Browser Flags

Scrapling uses optimized browser flags by default:
DEFAULT_ARGS = (
    '--no-pings',
    '--no-first-run',
    '--disable-infobars',
    '--disable-breakpad',
    '--no-service-autorun',
    '--homepage=about:blank',
    '--password-store=basic',
    '--disable-hang-monitor',
    '--no-default-browser-check',
    '--disable-session-crashed-bubble',
    '--disable-search-engine-choice-screen',
)
Source: scrapling/engines/constants.py:24-37 Add custom flags for additional optimization:
response = StealthyFetcher.fetch(
    'https://example.com',
    extra_flags=[
        '--disable-gpu',
        '--single-process',
    ]
)

Complete Optimization Example

Maximum Speed Configuration

from scrapling import Spider, StealthySession

class UltraFastSpider(Spider):
    name = 'ultra_fast'
    start_urls = ['https://example.com']
    
    # Concurrency
    concurrent_requests = 32
    concurrent_requests_per_domain = 4
    download_delay = 0.5
    
    def configure_sessions(self, manager):
        manager.add('fast', StealthySession(
            # Page pooling
            max_pages=10,
            
            # Resource blocking
            disable_resources=True,
            blocked_domains={
                'google-analytics.com',
                'googletagmanager.com',
                'facebook.com',
            },
            
            # Network optimization
            network_idle=False,
            load_dom=True,
            wait=0,
            timeout=15000,
            
            # Headless mode
            headless=True,
        ))
    
    async def parse(self, response):
        yield {'title': response.css('title::text').get()}
        
        # Follow links
        for link in response.css('a::attr(href)').getall()[:10]:
            yield response.follow(link, callback=self.parse_item)
    
    async def parse_item(self, response):
        yield {
            'url': response.url,
            'content': response.css('.content::text').get(),
        }

if __name__ == '__main__':
    spider = UltraFastSpider()
    result = spider.start(use_uvloop=True)  # Use faster event loop
    print(f"Scraped {len(result.items)} items")

Balanced Configuration

For sites that need more stability:
class BalancedSpider(Spider):
    name = 'balanced'
    start_urls = ['https://example.com']
    
    concurrent_requests = 8
    concurrent_requests_per_domain = 2
    download_delay = 1.0
    
    def configure_sessions(self, manager):
        manager.add('balanced', StealthySession(
            max_pages=5,
            disable_resources=True,
            network_idle=False,
            load_dom=True,
            wait=500,  # Small wait for JS
            timeout=30000,
            headless=True,
        ))
    
    async def parse(self, response):
        yield {'data': response.css('.data::text').get()}

Async Best Practices

Use uvloop

Use uvloop for faster async performance:
spider = MySpider()
result = spider.start(use_uvloop=True)  # 2-4x faster event loop
Source: scrapling/spiders/spider.py:264-281

Batch Processing

Process items in batches:
class BatchSpider(Spider):
    name = 'batch_processor'
    start_urls = ['https://example.com']
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.batch = []
        self.batch_size = 100
    
    async def on_scraped_item(self, item):
        self.batch.append(item)
        
        if len(self.batch) >= self.batch_size:
            await self._save_batch()
            self.batch = []
        
        return item
    
    async def _save_batch(self):
        # Batch save to database
        await db.insert_many(self.batch)
    
    async def on_close(self):
        # Save remaining items
        if self.batch:
            await self._save_batch()
    
    async def parse(self, response):
        yield {'data': response.css('.item::text').get()}

Memory Optimization

Parser Storage

For huge HTML documents, use disk-based storage:
from scrapling import Fetcher

Fetcher.configure(
    huge_tree=True,  # Use libxml2 HUGE_TREE option
    storage='disk'   # Store parsed tree on disk instead of memory
)

response = Fetcher.fetch('https://huge-page.com')
Source: scrapling/engines/toolbelt/custom.py:140-154

Adaptive Parsing

Adaptive mode reduces memory for simple selections:
Fetcher.configure(adaptive=True)

response = Fetcher.fetch('https://example.com')
# Uses minimal DOM representation for CSS/XPath queries
Source: scrapling/engines/toolbelt/custom.py:141

Benchmarking

Track performance metrics:
from scrapling import Spider
import time

class BenchmarkSpider(Spider):
    name = 'benchmark'
    start_urls = ['https://example.com']
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time = None
    
    async def on_start(self, resuming=False):
        self.start_time = time.time()
        self.logger.info("Starting benchmark...")
    
    async def on_close(self):
        duration = time.time() - self.start_time
        
        stats = self.stats
        self.logger.info(f"Duration: {duration:.2f}s")
        self.logger.info(f"Requests: {stats.request_count}")
        self.logger.info(f"Items: {stats.item_count}")
        self.logger.info(
            f"Throughput: {stats.request_count / duration:.2f} req/s"
        )
        self.logger.info(
            f"Items/s: {stats.item_count / duration:.2f}"
        )
    
    async def parse(self, response):
        yield {'data': response.css('title::text').get()}

Performance Comparison

ConfigurationRequests/secUse Case
Ultra Fast30-50Simple HTML, no anti-bot
Balanced10-20JavaScript sites, moderate protection
Stealth3-8Heavy anti-bot, Cloudflare
Conservative1-3Rate-limited APIs, respectful scraping

Best Practices

Measure performance before making changes:
import time

start = time.time()
response = StealthyFetcher.fetch('https://example.com')
duration = time.time() - start

print(f"Fetch took {duration:.2f}s")
Always block unnecessary resources:
# Fast
response = StealthyFetcher.fetch(
    url,
    disable_resources=True,
    blocked_domains={'analytics.com'}
)

# Slow
response = StealthyFetcher.fetch(url)
Start conservative, increase gradually:
concurrent_requests = 4   # Start
concurrent_requests = 8   # Test
concurrent_requests = 16  # Optimize
Monitor error rates as you increase concurrency.
Sessions reuse connections and browser instances:
# Slow - creates new browser each time
for url in urls:
    response = StealthyFetcher.fetch(url)

# Fast - reuses browser
with StealthySession() as session:
    for url in urls:
        response = session.fetch(url)

Anti-Bot Bypass

Balance speed with stealth

Error Handling

Handle errors efficiently

Handling Blocked Requests

Retry strategies

Cloudflare Turnstile

Solve challenges faster

Build docs developers (and LLMs) love