Migrating from Scrapy

If you’re coming from Scrapy, you’ll feel right at home with Scrapling’s spider system. The API is intentionally familiar, but Scrapling brings modern Python async/await patterns, simplified session management, and built-in pause/resume capabilities. This guide will help you migrate your existing Scrapy spiders to Scrapling.

Core Concepts Comparison

Scrapy Concept	Scrapling Equivalent	Notes
`scrapy.Spider`	`scrapling.spiders.Spider`	Similar base class with `name`, `start_urls`
`scrapy.Request`	`scrapling.spiders.Request`	Similar API, but simpler
`scrapy.Response`	`scrapling.engines.Response`	Extends `Selector` with additional methods
`parse()` method	`parse()` method	Must be async generator in Scrapling
`yield Request`	`yield Request`	Same pattern
`yield item`	`yield dict`	Just yield dictionaries
Item classes	Python dicts	No need for Item classes
Item Pipelines	`on_scraped_item()` hook	Simpler approach
Middlewares	Session configuration	Different architecture
`scrapy crawl`	`spider.start()`	Programmatic approach
Settings	Class attributes	Direct configuration

Spider Structure Comparison

Basic Spider

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['https://quotes.toscrape.com']
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }
        
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Key differences:

parse() must be an async generator in Scrapling
Type hint Response for better IDE support
Must specify callback=self.parse explicitly in follow requests

Running the Spider

# Command line
scrapy crawl quotes

# Or programmatically
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()

Advanced Features Comparison

Multiple Callbacks

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ['https://example.com/products']
    
    def parse(self, response):
        for product_url in response.css('a.product::attr(href)').getall():
            yield response.follow(product_url, callback=self.parse_product)
    
    def parse_product(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
        }

Request Metadata

import scrapy

class MySpider(scrapy.Spider):
    name = "metadata"
    start_urls = ['https://example.com']
    
    def parse(self, response):
        for url in response.css('a::attr(href)').getall():
            yield scrapy.Request(
                url,
                callback=self.parse_page,
                meta={'category': 'electronics'}
            )
    
    def parse_page(self, response):
        yield {
            'url': response.url,
            'category': response.meta['category'],
            'title': response.css('h1::text').get(),
        }

Concurrency Control

# In settings.py
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5

# Or in spider
class MySpider(scrapy.Spider):
    name = "my_spider"
    custom_settings = {
        'CONCURRENT_REQUESTS': 16,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 8,
        'DOWNLOAD_DELAY': 0.5,
    }

Allowed Domains

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']
    
    def parse(self, response):
        # Links to other domains are automatically filtered
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link)

Item Processing

Item Pipelines vs Hooks

# pipelines.py
class MyPipeline:
    def process_item(self, item, spider):
        # Clean price
        if 'price' in item:
            item['price'] = float(item['price'].replace('$', ''))
        return item

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MyPipeline': 300,
}

Session Management (Middlewares Alternative)

Using Different Session Types

Scrapy uses middlewares for request/response processing. Scrapling uses a session-based architecture:

# Scrapy requires middleware for browser automation
# Usually requires additional libraries like scrapy-playwright

import scrapy
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "browser_spider"
    
    def start_requests(self):
        yield scrapy.Request(
            "https://example.com",
            meta=dict(
                playwright=True,
                playwright_page_methods=[
                    PageMethod("wait_for_selector", "div.content"),
                ],
            ),
        )

Proxy Configuration

# settings.py
HTTPS_PROXY = 'http://proxy.example.com:8000'

# Or in spider with middleware
class MySpider(scrapy.Spider):
    name = "proxy_spider"
    
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={'proxy': 'http://proxy.example.com:8000'}
            )

Pause & Resume

Scrapy requires jobs directory configuration and command-line management. Scrapling makes it simple:

# settings.py
JOBDIR = "crawls/myspider"

# Command line
scrapy crawl myspider
# Press Ctrl+C to pause
# Run again to resume:
scrapy crawl myspider

Lifecycle Hooks

import scrapy

class MySpider(scrapy.Spider):
    name = "lifecycle"
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Setup code
    
    def spider_opened(self, spider):
        self.logger.info('Spider opened')
    
    def spider_closed(self, spider, reason):
        self.logger.info(f'Spider closed: {reason}')

Logging

# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'spider.log'

import scrapy

class MySpider(scrapy.Spider):
    name = "logging"
    
    def parse(self, response):
        self.logger.info('Processing page')
        self.logger.debug('Debug info')
        self.logger.warning('Warning message')

Selector Syntax

Good news! Scrapling uses the same selector syntax as Scrapy:

# Both work identically
response.css('div.quote span.text::text').get()
response.css('div.quote span.text::text').getall()
response.xpath('//div[@class="quote"]//span[@class="text"]/text()').get()

# Chaining
response.css('div.quote').css('span.text::text').getall()

Streaming Results

Scrapy doesn’t have built-in streaming. Scrapling does:

Scrapling Only

import asyncio
from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "streaming"
    start_urls = ['https://example.com']
    
    async def parse(self, response: Response):
        for item in response.css('.item'):
            yield {'title': item.css('h2::text').get()}

async def main():
    spider = MySpider()
    async for item in spider.stream():
        # Process items as they arrive
        print(f"Got item: {item}")
        # Check stats during crawl
        print(f"Progress: {spider.stats.items_scraped} items")

asyncio.run(main())

Complete Migration Example

Here’s a complete Scrapy spider migrated to Scrapling:

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/products']
    
    custom_settings = {
        'CONCURRENT_REQUESTS': 8,
        'DOWNLOAD_DELAY': 1,
    }
    
    def parse(self, response):
        for product in response.css('div.product'):
            loader = ItemLoader(item=Product(), selector=product)
            loader.add_css('name', 'h2::text')
            loader.add_css('price', 'span.price::text')
            loader.add_value('url', response.url)
            yield loader.load_item()
        
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Key Advantages of Scrapling

Modern Async/Await: Native async/await instead of Twisted deferreds
Simpler Architecture: No need for separate settings.py, items.py, pipelines.py
Built-in Sessions: Multiple fetcher types (HTTP, browser, stealth) in one spider
Easy Pause/Resume: Just pass crawldir parameter
Real-time Streaming: Stream items as they’re scraped with spider.stream()
Better Performance: Optimized parsing that’s faster than Scrapy’s Parsel
Type Hints: Full type coverage for better IDE support
Simpler API: Less boilerplate, more Pythonic

What Scrapling Doesn’t Have

No built-in commands system (like scrapy genspider)
No extensions system (use Python decorators/inheritance)
No contracts for testing (use standard Python testing)
Simpler than Scrapy’s full framework approach

Next Steps

Scrapling gives you the power of Scrapy with a modern, simpler API. Happy scraping!

Getting Started

Core Concepts

Fetching

Parsing & Selection

Spiders

CLI & Tools

AI Integration

Guides

Tutorials

Migrating from Scrapy

Core Concepts Comparison

Spider Structure Comparison

Basic Spider

Running the Spider

Advanced Features Comparison

Multiple Callbacks

Request Metadata

Concurrency Control

Allowed Domains

Item Processing

Item Pipelines vs Hooks

Session Management (Middlewares Alternative)

Using Different Session Types

Proxy Configuration

Pause & Resume

Lifecycle Hooks

Logging

Selector Syntax

Streaming Results

Complete Migration Example

Key Advantages of Scrapling

What Scrapling Doesn’t Have

Next Steps

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Fetching

Parsing & Selection

Spiders

CLI & Tools

AI Integration

Guides

Tutorials

Documentation Index

​Core Concepts Comparison

​Spider Structure Comparison

​Basic Spider

​Running the Spider

​Advanced Features Comparison

​Multiple Callbacks

​Request Metadata

​Concurrency Control

​Allowed Domains

​Item Processing

​Item Pipelines vs Hooks

​Session Management (Middlewares Alternative)

​Using Different Session Types

​Proxy Configuration

​Pause & Resume

​Lifecycle Hooks

​Logging

​Selector Syntax

​Streaming Results

​Complete Migration Example

​Key Advantages of Scrapling

​What Scrapling Doesn’t Have

​Next Steps

Build docs developers (and LLMs) love

Core Concepts Comparison

Spider Structure Comparison

Basic Spider

Running the Spider

Advanced Features Comparison

Multiple Callbacks

Request Metadata

Concurrency Control

Allowed Domains

Item Processing

Item Pipelines vs Hooks

Session Management (Middlewares Alternative)

Using Different Session Types

Proxy Configuration

Pause & Resume

Lifecycle Hooks

Logging

Selector Syntax

Streaming Results

Complete Migration Example

Key Advantages of Scrapling

What Scrapling Doesn’t Have

Next Steps