Documentation Index Fetch the complete documentation index at: https://mintlify.com/D4Vinci/Scrapling/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Scrapling is built on a modular, layered architecture that separates concerns between fetching, parsing, and data extraction. The framework is designed around three core pillars:
Fetchers Handle HTTP requests and browser automation
Parser Process and navigate HTML/XML documents
Sessions Manage persistent connections and state
Core Components
Fetcher Layer
The fetcher layer provides a unified interface for making web requests, abstracting away the differences between various HTTP clients and browser automation tools.
from scrapling import Fetcher, StealthyFetcher, DynamicFetcher
# Simple HTTP requests
response = Fetcher.fetch( 'https://example.com' )
# Browser-based with JavaScript execution
response = DynamicFetcher.fetch( 'https://example.com' )
# Stealth mode with anti-detection
response = StealthyFetcher.fetch( 'https://example.com' )
All fetchers return a unified Response object that inherits from the Selector class, providing immediate parsing capabilities.
Parser Layer
At the heart of Scrapling is the Selector class, built on top of lxml for high-performance HTML/XML parsing.
Key Features:
CSS and XPath selector support
Adaptive element relocation (survives page structure changes)
Rich text extraction with TextHandler
Attribute handling with AttributesHandler
Tree navigation (parent, children, siblings)
from scrapling import Selector
# From HTML string
html = '<div class="product"><h2>Item</h2></div>'
selector = Selector(html)
# CSS selectors
title = selector.css( '.product h2::text' ).get()
# XPath selectors
title = selector.xpath( '//div[@class="product"]/h2/text()' ).get()
# Find methods with filters
product = selector.find( 'div' , class_ = 'product' )
Response System
The Response class extends Selector with HTTP-specific metadata:
response = Fetcher.fetch( 'https://httpbin.org/get' )
# HTTP metadata
print (response.status) # 200
print (response.headers) # Response headers
print (response.cookies) # Cookies dict
print (response.request_headers) # Request headers
# Parsing capabilities (inherited from Selector)
title = response.css( 'title::text' ).get()
links = response.css( 'a::attr(href)' ).getall()
# JSON responses
data = response.json()
Architecture Diagram
Request Flow User Code
↓
Fetcher (Fetcher/DynamicFetcher/StealthyFetcher)
↓
Engine Layer (curl_cffi/Playwright)
↓
Response (extends Selector)
↓
Selector (lxml-based parser)
Core Inheritance Selector (base parser)
↑
Response (adds HTTP metadata)
↑
Spider Results (optional spider framework)
Engine Layer
Scrapling uses different engines depending on the fetcher type:
Static Engine (curl_cffi)
Used by Fetcher and AsyncFetcher for fast HTTP requests with browser impersonation:
# Located in: scrapling/engines/static.py
from scrapling.fetchers import Fetcher
response = Fetcher.fetch(
'https://httpbin.org/get' ,
impersonate = 'chrome' , # Browser fingerprint
stealthy_headers = True , # Auto-generate realistic headers
timeout = 30 ,
retries = 3
)
Features:
Browser impersonation (Chrome, Firefox, Safari, Edge)
HTTP/2 and HTTP/3 support
Automatic header generation
Connection pooling with sessions
Browser Engine (Playwright)
Used by DynamicFetcher and StealthyFetcher for JavaScript-heavy sites:
# Located in: scrapling/engines/_browsers/
from scrapling.fetchers import DynamicFetcher
response = DynamicFetcher.fetch(
'https://example.com' ,
headless = True ,
disable_resources = True , # Block images, fonts, etc.
network_idle = True , # Wait for network idle
load_dom = True # Wait for DOM ready
)
Features:
Real browser automation (Chromium-based)
JavaScript execution
Network interception and resource blocking
Page pooling for sessions
Stealth mode with anti-detection evasion
Data Flow
1. Request Initialization
# User makes a request
response = Fetcher.fetch( 'https://example.com' )
2. Configuration Merging
The fetcher merges default settings with request-specific parameters:
# From scrapling/engines/static.py
class _ConfigurationLogic :
def _merge_request_args ( self , ** method_kwargs ):
# Merges session defaults with request params
final_args = {
'headers' : self ._headers_job( ... ),
'timeout' : self ._get_param(kwargs, 'timeout' , self ._default_timeout),
'impersonate' : _select_random_browser(impersonate),
# ... more args
}
3. Engine Execution
The appropriate engine executes the request:
# HTTP requests use curl_cffi
session.request(method, ** request_args)
# Browser requests use Playwright
page.goto(url, referer = referer)
4. Response Creation
Raw responses are converted to unified Response objects:
# From scrapling/engines/toolbelt/convertor.py
class ResponseFactory :
@ staticmethod
def from_http_request ( response , selector_config , meta ):
return Response(
url = str (response.url),
content = response.content,
status = response.status_code,
# ... more fields
)
5. Parsing Layer Access
The Response inherits all Selector parsing methods:
# User extracts data using Selector methods
title = response.css( 'title::text' ).get()
links = response.find_all( 'a' , href = True )
Design Principles
Lazy Imports
Scrapling uses lazy imports for faster startup times:
# From scrapling/__init__.py
_LAZY_IMPORTS = {
"Fetcher" : ( "scrapling.fetchers" , "Fetcher" ),
"Selector" : ( "scrapling.parser" , "Selector" ),
# Only imported when accessed
}
def __getattr__ ( name : str ):
if name in _LAZY_IMPORTS :
module_path, class_name = _LAZY_IMPORTS [name]
module = __import__ (module_path, fromlist = [class_name])
return getattr (module, class_name)
Unified Response Interface
All fetchers return the same Response type, making it easy to switch between different fetching strategies:
# All return Response objects
response1 = Fetcher.fetch(url) # HTTP client
response2 = DynamicFetcher.fetch(url) # Browser automation
response3 = StealthyFetcher.fetch(url) # Stealth browser
# Same API for all
data = response1.css( '.product::text' ).get()
data = response2.css( '.product::text' ).get()
data = response3.css( '.product::text' ).get()
Separation of Concerns
Each layer has a clear responsibility:
Fetchers : Network communication and browser control
Engines : Low-level HTTP/browser implementation
Parser : HTML/XML processing and navigation
Sessions : State management and connection pooling
Toolbelt : Shared utilities (proxy rotation, fingerprints, etc.)
Cached Properties:
# From scrapling/parser.py
class Selector :
@ property
def tag ( self ) -> str :
if not self .__tag:
self .__tag = str ( self ._root.tag)
return self .__tag # Computed once, cached
Pre-compiled XPath:
# Pre-compiled for efficiency
_find_all_elements = XPath( ".//*" )
_find_all_elements_with_spaces = XPath( ".//*[normalize-space(text())]" )
Element Conversion:
def __elements_convertor ( self , elements ):
# Store config once, reuse for all elements
url = self .url
encoding = self .encoding
adaptive = self .__adaptive_enabled
return Selectors(
Selector( root = el, url = url, encoding = encoding, adaptive = adaptive)
for el in elements
)
Extension Points
Scrapling is designed to be extensible:
Custom Storage System
Implement custom storage for adaptive element relocation:
from scrapling.core.storage import StorageSystemMixin
from functools import lru_cache
@lru_cache ( maxsize = 128 )
class RedisStorage ( StorageSystemMixin ):
def save ( self , element , identifier ):
# Custom save logic
pass
def retrieve ( self , identifier ):
# Custom retrieve logic
pass
# Use custom storage
selector = Selector(
html,
adaptive = True ,
storage = RedisStorage,
storage_args = { 'host' : 'localhost' }
)
Custom Fetcher
Extend BaseFetcher for custom fetching logic:
from scrapling.engines.toolbelt.custom import BaseFetcher
class CustomFetcher ( BaseFetcher ):
@ classmethod
def fetch ( cls , url : str , ** kwargs ):
# Custom fetching logic
# Must return Response object
pass
File Structure
scrapling/
├── __init__.py # Lazy imports, main exports
├── parser.py # Selector and Selectors classes
├── fetchers/
│ ├── __init__.py # Fetcher exports
│ ├── requests.py # HTTP-based fetchers
│ ├── chrome.py # DynamicFetcher
│ └── stealth_chrome.py # StealthyFetcher
├── engines/
│ ├── __init__.py
│ ├── static.py # curl_cffi engine
│ ├── constants.py # Browser arguments
│ ├── _browsers/ # Playwright engine
│ │ ├── _base.py # Session base classes
│ │ ├── _controllers.py # DynamicSession
│ │ ├── _stealth.py # StealthySession
│ │ └── _page.py # Page pooling
│ └── toolbelt/ # Shared utilities
│ ├── custom.py # Response, BaseFetcher
│ ├── convertor.py # ResponseFactory
│ ├── fingerprints.py # Header generation
│ └── proxy_rotation.py # ProxyRotator
└── core/
├── custom_types.py # TextHandler, AttributesHandler
├── storage.py # Adaptive storage system
├── translator.py # CSS to XPath conversion
└── utils/ # Utilities
Next Steps
Fetchers Learn about different fetcher types
Parsing Deep dive into the parsing system
Sessions Understand session management