Documentation Index Fetch the complete documentation index at: https://mintlify.com/D4Vinci/Scrapling/llms.txt
Use this file to discover all available pages before exploring further.
Scrapling provides multiple ways to optimize scraping performance, from resource blocking to concurrent requests and page pooling.
Resource Blocking
Block unnecessary resources to speed up page loads:
Disable Resources
Block fonts, images, media, and other non-essential resources:
from scrapling import StealthyFetcher
response = StealthyFetcher.fetch(
'https://example.com' ,
disable_resources = True # Significant speed boost
)
Blocked resource types:
font - Web fonts
image - Images and icons
media - Videos and audio
beacon - Analytics beacons
object - Embedded objects
imageset - Responsive images
texttrack - Video subtitles
websocket - WebSocket connections
csp_report - CSP reports
stylesheet - CSS files
Source: scrapling/engines/constants.py:2-13
Block Specific Domains
Block analytics and tracking domains:
response = StealthyFetcher.fetch(
'https://example.com' ,
blocked_domains = {
'google-analytics.com' ,
'facebook.com' ,
'doubleclick.net' ,
'googletagmanager.com' ,
'hotjar.com' ,
}
)
Subdomains are automatically matched: blocking example.com also blocks sub.example.com.
Source: scrapling/engines/_browsers/_stealth.py:199
Concurrency Settings
Concurrent Requests
Control how many requests run simultaneously:
from scrapling import Spider
class FastSpider ( Spider ):
name = 'fast_spider'
concurrent_requests = 16 # Process 16 requests at once (default: 4)
start_urls = [ 'https://example.com' ]
async def parse ( self , response ):
yield { 'title' : response.css( 'title::text' ).get()}
Source: scrapling/spiders/spider.py:76
Per-Domain Concurrency
Limit concurrent requests per domain to avoid overwhelming servers:
class PoliteSpider ( Spider ):
name = 'polite_spider'
concurrent_requests = 16
concurrent_requests_per_domain = 2 # Max 2 simultaneous requests per domain
start_urls = [ 'https://example.com' ]
async def parse ( self , response ):
yield { 'data' : response.css( '.content::text' ).get()}
Source: scrapling/spiders/spider.py:77
Download Delay
Add delay between requests to the same domain:
class ThrottledSpider ( Spider ):
name = 'throttled_spider'
download_delay = 2.0 # Wait 2 seconds between requests to same domain
start_urls = [ 'https://example.com' ]
async def parse ( self , response ):
yield { 'content' : response.text}
Source: scrapling/spiders/spider.py:78
Page Pooling
Reuse browser pages instead of creating new ones:
Session Page Pooling
from scrapling import StealthySession
# Create session with page pool
with StealthySession( max_pages = 5 ) as session:
# Pages are reused from the pool
for url in urls:
response = session.fetch(url)
# Process response
Configuration: scrapling/engines/_browsers/_validators.py:62
Pool Statistics
Monitor page pool usage:
with StealthySession( max_pages = 10 ) as session:
response = session.fetch( 'https://example.com' )
stats = session.get_pool_stats()
print ( f "Total pages: { stats[ 'total_pages' ] } " )
print ( f "Busy pages: { stats[ 'busy_pages' ] } " )
print ( f "Max pages: { stats[ 'max_pages' ] } " )
Source: scrapling/engines/_browsers/_base.py:125-131
Network Optimization
Skip Network Idle
Don’t wait for network to be completely idle:
response = StealthyFetcher.fetch(
'https://example.com' ,
network_idle = False # Don't wait for network idle (default)
)
Only enable network_idle=True when you need to ensure all async requests complete.
Source: scrapling/engines/_browsers/_stealth.py:52
Skip DOM Loading
For static content, skip waiting for JavaScript:
response = StealthyFetcher.fetch(
'https://example.com' ,
load_dom = False # Don't wait for DOMContentLoaded
)
Disabling load_dom may result in incomplete content for JavaScript-heavy sites. Default is True.
Source: scrapling/engines/_browsers/_stealth.py:67
Reduce Wait Time
Minimize or remove post-load wait:
response = StealthyFetcher.fetch(
'https://example.com' ,
wait = 0 # No wait after page loads (default)
)
Source: scrapling/engines/_browsers/_stealth.py:54
Timeout Optimization
Set appropriate timeouts:
response = StealthyFetcher.fetch(
'https://example.com' ,
timeout = 15000 # 15 seconds (default: 30 seconds)
)
Lower timeouts fail faster on slow sites, higher timeouts give more time for complex pages.
Source: scrapling/engines/_browsers/_stealth.py:53
Headless Mode
Always use headless mode in production:
response = StealthyFetcher.fetch(
'https://example.com' ,
headless = True # Faster than headful (default)
)
Headful mode is 20-30% slower and should only be used for debugging.
Source: scrapling/engines/_browsers/_stealth.py:46
Browser Flags
Scrapling uses optimized browser flags by default:
DEFAULT_ARGS = (
'--no-pings' ,
'--no-first-run' ,
'--disable-infobars' ,
'--disable-breakpad' ,
'--no-service-autorun' ,
'--homepage=about:blank' ,
'--password-store=basic' ,
'--disable-hang-monitor' ,
'--no-default-browser-check' ,
'--disable-session-crashed-bubble' ,
'--disable-search-engine-choice-screen' ,
)
Source: scrapling/engines/constants.py:24-37
Add custom flags for additional optimization:
response = StealthyFetcher.fetch(
'https://example.com' ,
extra_flags = [
'--disable-gpu' ,
'--single-process' ,
]
)
Complete Optimization Example
Maximum Speed Configuration
from scrapling import Spider, StealthySession
class UltraFastSpider ( Spider ):
name = 'ultra_fast'
start_urls = [ 'https://example.com' ]
# Concurrency
concurrent_requests = 32
concurrent_requests_per_domain = 4
download_delay = 0.5
def configure_sessions ( self , manager ):
manager.add( 'fast' , StealthySession(
# Page pooling
max_pages = 10 ,
# Resource blocking
disable_resources = True ,
blocked_domains = {
'google-analytics.com' ,
'googletagmanager.com' ,
'facebook.com' ,
},
# Network optimization
network_idle = False ,
load_dom = True ,
wait = 0 ,
timeout = 15000 ,
# Headless mode
headless = True ,
))
async def parse ( self , response ):
yield { 'title' : response.css( 'title::text' ).get()}
# Follow links
for link in response.css( 'a::attr(href)' ).getall()[: 10 ]:
yield response.follow(link, callback = self .parse_item)
async def parse_item ( self , response ):
yield {
'url' : response.url,
'content' : response.css( '.content::text' ).get(),
}
if __name__ == '__main__' :
spider = UltraFastSpider()
result = spider.start( use_uvloop = True ) # Use faster event loop
print ( f "Scraped { len (result.items) } items" )
Balanced Configuration
For sites that need more stability:
class BalancedSpider ( Spider ):
name = 'balanced'
start_urls = [ 'https://example.com' ]
concurrent_requests = 8
concurrent_requests_per_domain = 2
download_delay = 1.0
def configure_sessions ( self , manager ):
manager.add( 'balanced' , StealthySession(
max_pages = 5 ,
disable_resources = True ,
network_idle = False ,
load_dom = True ,
wait = 500 , # Small wait for JS
timeout = 30000 ,
headless = True ,
))
async def parse ( self , response ):
yield { 'data' : response.css( '.data::text' ).get()}
Async Best Practices
Use uvloop
Use uvloop for faster async performance:
spider = MySpider()
result = spider.start( use_uvloop = True ) # 2-4x faster event loop
Source: scrapling/spiders/spider.py:264-281
Batch Processing
Process items in batches:
class BatchSpider ( Spider ):
name = 'batch_processor'
start_urls = [ 'https://example.com' ]
def __init__ ( self , * args , ** kwargs ):
super (). __init__ ( * args, ** kwargs)
self .batch = []
self .batch_size = 100
async def on_scraped_item ( self , item ):
self .batch.append(item)
if len ( self .batch) >= self .batch_size:
await self ._save_batch()
self .batch = []
return item
async def _save_batch ( self ):
# Batch save to database
await db.insert_many( self .batch)
async def on_close ( self ):
# Save remaining items
if self .batch:
await self ._save_batch()
async def parse ( self , response ):
yield { 'data' : response.css( '.item::text' ).get()}
Memory Optimization
Parser Storage
For huge HTML documents, use disk-based storage:
from scrapling import Fetcher
Fetcher.configure(
huge_tree = True , # Use libxml2 HUGE_TREE option
storage = 'disk' # Store parsed tree on disk instead of memory
)
response = Fetcher.fetch( 'https://huge-page.com' )
Source: scrapling/engines/toolbelt/custom.py:140-154
Adaptive Parsing
Adaptive mode reduces memory for simple selections:
Fetcher.configure( adaptive = True )
response = Fetcher.fetch( 'https://example.com' )
# Uses minimal DOM representation for CSS/XPath queries
Source: scrapling/engines/toolbelt/custom.py:141
Benchmarking
Track performance metrics:
from scrapling import Spider
import time
class BenchmarkSpider ( Spider ):
name = 'benchmark'
start_urls = [ 'https://example.com' ]
def __init__ ( self , * args , ** kwargs ):
super (). __init__ ( * args, ** kwargs)
self .start_time = None
async def on_start ( self , resuming = False ):
self .start_time = time.time()
self .logger.info( "Starting benchmark..." )
async def on_close ( self ):
duration = time.time() - self .start_time
stats = self .stats
self .logger.info( f "Duration: { duration :.2f} s" )
self .logger.info( f "Requests: { stats.request_count } " )
self .logger.info( f "Items: { stats.item_count } " )
self .logger.info(
f "Throughput: { stats.request_count / duration :.2f} req/s"
)
self .logger.info(
f "Items/s: { stats.item_count / duration :.2f} "
)
async def parse ( self , response ):
yield { 'data' : response.css( 'title::text' ).get()}
Configuration Requests/sec Use Case Ultra Fast 30-50 Simple HTML, no anti-bot Balanced 10-20 JavaScript sites, moderate protection Stealth 3-8 Heavy anti-bot, Cloudflare Conservative 1-3 Rate-limited APIs, respectful scraping
Best Practices
Profile Before Optimizing
Measure performance before making changes: import time
start = time.time()
response = StealthyFetcher.fetch( 'https://example.com' )
duration = time.time() - start
print ( f "Fetch took { duration :.2f} s" )
Always block unnecessary resources: # Fast
response = StealthyFetcher.fetch(
url,
disable_resources = True ,
blocked_domains = { 'analytics.com' }
)
# Slow
response = StealthyFetcher.fetch(url)
Start conservative, increase gradually: concurrent_requests = 4 # Start
concurrent_requests = 8 # Test
concurrent_requests = 16 # Optimize
Monitor error rates as you increase concurrency.
Use Sessions for Multiple Requests
Sessions reuse connections and browser instances: # Slow - creates new browser each time
for url in urls:
response = StealthyFetcher.fetch(url)
# Fast - reuses browser
with StealthySession() as session:
for url in urls:
response = session.fetch(url)
Anti-Bot Bypass Balance speed with stealth
Error Handling Handle errors efficiently
Handling Blocked Requests Retry strategies
Cloudflare Turnstile Solve challenges faster