Skip to main content
If you’re coming from Scrapy, you’ll feel right at home with Scrapling’s spider system. The API is intentionally familiar, but Scrapling brings modern Python async/await patterns, simplified session management, and built-in pause/resume capabilities. This guide will help you migrate your existing Scrapy spiders to Scrapling.

Core Concepts Comparison

Scrapy ConceptScrapling EquivalentNotes
scrapy.Spiderscrapling.spiders.SpiderSimilar base class with name, start_urls
scrapy.Requestscrapling.spiders.RequestSimilar API, but simpler
scrapy.Responsescrapling.engines.ResponseExtends Selector with additional methods
parse() methodparse() methodMust be async generator in Scrapling
yield Requestyield RequestSame pattern
yield itemyield dictJust yield dictionaries
Item classesPython dictsNo need for Item classes
Item Pipelineson_scraped_item() hookSimpler approach
MiddlewaresSession configurationDifferent architecture
scrapy crawlspider.start()Programmatic approach
SettingsClass attributesDirect configuration

Spider Structure Comparison

Basic Spider

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['https://quotes.toscrape.com']
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }
        
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
Key differences:
  1. parse() must be an async generator in Scrapling
  2. Type hint Response for better IDE support
  3. Must specify callback=self.parse explicitly in follow requests

Running the Spider

# Command line
scrapy crawl quotes

# Or programmatically
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()

Advanced Features Comparison

Multiple Callbacks

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ['https://example.com/products']
    
    def parse(self, response):
        for product_url in response.css('a.product::attr(href)').getall():
            yield response.follow(product_url, callback=self.parse_product)
    
    def parse_product(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
        }

Request Metadata

import scrapy

class MySpider(scrapy.Spider):
    name = "metadata"
    start_urls = ['https://example.com']
    
    def parse(self, response):
        for url in response.css('a::attr(href)').getall():
            yield scrapy.Request(
                url,
                callback=self.parse_page,
                meta={'category': 'electronics'}
            )
    
    def parse_page(self, response):
        yield {
            'url': response.url,
            'category': response.meta['category'],
            'title': response.css('h1::text').get(),
        }

Concurrency Control

# In settings.py
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5

# Or in spider
class MySpider(scrapy.Spider):
    name = "my_spider"
    custom_settings = {
        'CONCURRENT_REQUESTS': 16,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 8,
        'DOWNLOAD_DELAY': 0.5,
    }

Allowed Domains

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']
    
    def parse(self, response):
        # Links to other domains are automatically filtered
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link)

Item Processing

Item Pipelines vs Hooks

# pipelines.py
class MyPipeline:
    def process_item(self, item, spider):
        # Clean price
        if 'price' in item:
            item['price'] = float(item['price'].replace('$', ''))
        return item

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MyPipeline': 300,
}

Session Management (Middlewares Alternative)

Using Different Session Types

Scrapy uses middlewares for request/response processing. Scrapling uses a session-based architecture:
# Scrapy requires middleware for browser automation
# Usually requires additional libraries like scrapy-playwright

import scrapy
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "browser_spider"
    
    def start_requests(self):
        yield scrapy.Request(
            "https://example.com",
            meta=dict(
                playwright=True,
                playwright_page_methods=[
                    PageMethod("wait_for_selector", "div.content"),
                ],
            ),
        )

Proxy Configuration

# settings.py
HTTPS_PROXY = 'http://proxy.example.com:8000'

# Or in spider with middleware
class MySpider(scrapy.Spider):
    name = "proxy_spider"
    
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={'proxy': 'http://proxy.example.com:8000'}
            )

Pause & Resume

Scrapy requires jobs directory configuration and command-line management. Scrapling makes it simple:
# settings.py
JOBDIR = "crawls/myspider"

# Command line
scrapy crawl myspider
# Press Ctrl+C to pause
# Run again to resume:
scrapy crawl myspider

Lifecycle Hooks

import scrapy

class MySpider(scrapy.Spider):
    name = "lifecycle"
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Setup code
    
    def spider_opened(self, spider):
        self.logger.info('Spider opened')
    
    def spider_closed(self, spider, reason):
        self.logger.info(f'Spider closed: {reason}')

Logging

# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'spider.log'

import scrapy

class MySpider(scrapy.Spider):
    name = "logging"
    
    def parse(self, response):
        self.logger.info('Processing page')
        self.logger.debug('Debug info')
        self.logger.warning('Warning message')

Selector Syntax

Good news! Scrapling uses the same selector syntax as Scrapy:
# Both work identically
response.css('div.quote span.text::text').get()
response.css('div.quote span.text::text').getall()
response.xpath('//div[@class="quote"]//span[@class="text"]/text()').get()

# Chaining
response.css('div.quote').css('span.text::text').getall()

Streaming Results

Scrapy doesn’t have built-in streaming. Scrapling does:
Scrapling Only
import asyncio
from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "streaming"
    start_urls = ['https://example.com']
    
    async def parse(self, response: Response):
        for item in response.css('.item'):
            yield {'title': item.css('h2::text').get()}

async def main():
    spider = MySpider()
    async for item in spider.stream():
        # Process items as they arrive
        print(f"Got item: {item}")
        # Check stats during crawl
        print(f"Progress: {spider.stats.items_scraped} items")

asyncio.run(main())

Complete Migration Example

Here’s a complete Scrapy spider migrated to Scrapling:
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/products']
    
    custom_settings = {
        'CONCURRENT_REQUESTS': 8,
        'DOWNLOAD_DELAY': 1,
    }
    
    def parse(self, response):
        for product in response.css('div.product'):
            loader = ItemLoader(item=Product(), selector=product)
            loader.add_css('name', 'h2::text')
            loader.add_css('price', 'span.price::text')
            loader.add_value('url', response.url)
            yield loader.load_item()
        
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Key Advantages of Scrapling

  1. Modern Async/Await: Native async/await instead of Twisted deferreds
  2. Simpler Architecture: No need for separate settings.py, items.py, pipelines.py
  3. Built-in Sessions: Multiple fetcher types (HTTP, browser, stealth) in one spider
  4. Easy Pause/Resume: Just pass crawldir parameter
  5. Real-time Streaming: Stream items as they’re scraped with spider.stream()
  6. Better Performance: Optimized parsing that’s faster than Scrapy’s Parsel
  7. Type Hints: Full type coverage for better IDE support
  8. Simpler API: Less boilerplate, more Pythonic

What Scrapling Doesn’t Have

  • No built-in commands system (like scrapy genspider)
  • No extensions system (use Python decorators/inheritance)
  • No contracts for testing (use standard Python testing)
  • Simpler than Scrapy’s full framework approach

Next Steps

Scrapling gives you the power of Scrapy with a modern, simpler API. Happy scraping!

Build docs developers (and LLMs) love