Introduction
Prerequisites
- You’ve completed or read the Fetchers basics page to understand the different fetcher types and when to use each one.
- You’ve completed or read the Main classes page to understand the Selector and Response classes.
- You’ve read the Architecture page for a high-level overview of how the spider system works.
Your First Spider
A spider is a class that defines how to crawl and extract data from websites. Here’s the simplest possible spider:name— A unique identifier for the spider.start_urls— A list of URLs to start crawling from.parse()— An async generator method that processes each response and yields results.
parse() method is where the magic happens. You use the same selection methods you’d use with Scrapling’s Selector/Response, and yield dictionaries to output scraped items.
Running the Spider
To run your spider, create an instance and callstart():
start() method handles all the async machinery internally — no need to worry about event loops. While the spider is running, everything that happens is logged to the terminal, and at the end of the crawl, you get very detailed stats.
Those stats are in the returned CrawlResult object, which gives you everything you need:
Following Links
Most crawls need to follow links across multiple pages. Useresponse.follow() to create follow-up requests:
response.follow() handles relative URLs automatically — it joins them with the current page’s URL. It also sets the current page as the Referer header by default.
You can point follow-up requests at different callback methods for different page types:
All callback methods must be async generators (using
async def and yield).Exporting Data
TheItemList returned in result.items has built-in export methods:
Filtering Domains
Useallowed_domains to restrict the spider to specific domains. This prevents it from accidentally following links to external websites:
allowed_domains = {"example.com"} also allows sub.example.com, blog.example.com, etc.
When a request is filtered out, it’s counted in stats.offsite_requests_count so you can see how many were dropped.
What’s Next
Now that you have the basics, you can explore:- Requests & Responses — learn about request priority, deduplication, metadata, and more.
- Sessions — use multiple fetcher types (HTTP, browser, stealth) in a single spider.
- Advanced features — concurrency control, pause/resume, streaming, lifecycle hooks, and logging.