The Request class represents a single request in Scrapling’s spider framework. It encapsulates the URL, callback, priority, metadata, and session parameters for fetching and processing web pages.
Class Definition
from scrapling.spiders import Request
class Request:
"""Represents a request to be processed by a Spider."""
Constructor
def __init__(
self,
url: str,
sid: str = "",
callback: Callable[[Response], AsyncGenerator] | None = None,
priority: int = 0,
dont_filter: bool = False,
meta: dict[str, Any] | None = None,
_retry_count: int = 0,
**kwargs: Any
)
Session ID to use for this request. If empty, the spider’s default session is used.
callback
Callable | None
default:"None"
Async generator function to process the response. If None, the spider’s parse() method is used.
Request priority. Higher values are processed first. Default is 0.
If True, this request won’t be filtered by the duplicate filter, even if it’s already been seen.
meta
dict[str, Any] | None
default:"None"
Arbitrary metadata dictionary to pass along with the request. Merged with response.meta.
Internal retry counter (managed automatically by the engine).
Additional session-specific keyword arguments (e.g., headers, proxy, method, data, json). These are passed to the session’s fetch method.
Attributes
Session ID for this request.
Response processing callback.
Request priority for scheduling.
Whether to bypass duplicate filtering.
Cached property that extracts the domain from the URL (e.g., “example.com”).
Methods
copy
def copy(self) -> Request
Create a copy of this request. Useful when retrying or modifying requests.
Returns: A new Request instance with copied attributes
Example:
original = Request("https://example.com", priority=5)
retry = original.copy()
retry.priority = 10 # Increase priority for retry
update_fingerprint
def update_fingerprint(
self,
include_kwargs: bool = False,
include_headers: bool = False,
keep_fragments: bool = False,
) -> bytes
Generate a unique fingerprint for deduplication. The fingerprint is cached in self._fp after first computation.
Include session kwargs (except data/json) in the fingerprint.
Include request headers in the fingerprint.
Keep URL fragments when canonicalizing the URL for fingerprinting.
Returns: SHA-1 hash bytes representing the unique fingerprint
The fingerprint is based on: URL (canonicalized), session ID, HTTP method, request body (data/json), and optionally headers and kwargs.
Special Methods
Comparison
Requests can be compared for priority-based sorting:
# Higher priority requests are "greater than" lower priority ones
req1 = Request("https://example.com", priority=5)
req2 = Request("https://example.com", priority=10)
req2 > req1 # True
req1 < req2 # True
Equality
Requests are equal if they have the same fingerprint:
req1 = Request("https://example.com")
req2 = Request("https://example.com")
# Must generate fingerprints first
req1.update_fingerprint()
req2.update_fingerprint()
req1 == req2 # True (same URL, same session, same method)
You must call update_fingerprint() before comparing requests with ==, otherwise a RuntimeError is raised.
String Representation
req = Request("https://example.com", priority=5, callback=spider.parse)
print(req)
# Output: https://example.com
print(repr(req))
# Output: <Request(https://example.com) priority=5 callback=parse>
Serialization
Requests support pickling for checkpoint/resume functionality. The callback is stored as a method name string and restored from the spider instance.
import pickle
req = Request("https://example.com", callback=spider.parse)
serialized = pickle.dumps(req)
restored = pickle.loads(serialized)
# Callback is None after unpickling
restored._restore_callback(spider) # Restore from spider
Usage Examples
Basic Request
from scrapling.spiders import Request
# Simple GET request
request = Request("https://api.example.com/data")
POST Request with JSON
request = Request(
"https://api.example.com/search",
method="POST",
json={"query": "scrapling", "limit": 10},
headers={"Authorization": "Bearer token"}
)
Request with Custom Callback
class MySpider(Spider):
async def parse(self, response):
# Extract detail page links
for link in response.css("a.detail::attr(href)").getall():
yield Request(
response.urljoin(link),
callback=self.parse_detail,
priority=10 # Higher priority for detail pages
)
async def parse_detail(self, response):
yield {
"title": response.css("h1::text").get(),
"content": response.css(".content::text").get()
}
async def parse(self, response):
# Pass data between callbacks using meta
category = response.css(".category::text").get()
for product in response.css(".product"):
link = product.css("a::attr(href)").get()
yield Request(
response.urljoin(link),
callback=self.parse_product,
meta={"category": category, "page": 1}
)
async def parse_product(self, response):
# Access metadata from response.meta
yield {
"name": response.css(".name::text").get(),
"category": response.meta["category"],
"page": response.meta["page"]
}
Request with Different Session
class MySpider(Spider):
def configure_sessions(self, manager):
from scrapling.fetchers import FetcherSession, AsyncStealthySession
manager.add("default", FetcherSession())
manager.add("stealth", AsyncStealthySession())
async def parse(self, response):
# Use stealth session for sensitive pages
yield Request(
"https://example.com/protected",
sid="stealth",
callback=self.parse_protected
)
Request with Proxy
from scrapling.engines.toolbelt import ProxyRotator
class MySpider(Spider):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.proxy_rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080"
])
async def parse(self, response):
yield Request(
"https://example.com/data",
proxy=self.proxy_rotator.get_proxy()
)
Bypassing Duplicate Filter
async def parse(self, response):
# Normal request - will be filtered if seen before
yield Request("https://example.com/data")
# Force re-fetch even if seen before
yield Request(
"https://example.com/data",
dont_filter=True,
meta={"reason": "forced_update"}
)
Internal Attributes
Number of times this request has been retried (managed by CrawlerEngine).
Dictionary of keyword arguments to pass to the session’s fetch method.
Cached fingerprint bytes. None until update_fingerprint() is called.
Temporary attribute used during pickling to store callback method name.
See Also