Documentation Index
Fetch the complete documentation index at: https://mintlify.com/D4Vinci/Scrapling/llms.txt
Use this file to discover all available pages before exploring further.
If you’re coming from Scrapy, you’ll feel right at home with Scrapling’s spider system. The API is intentionally familiar, but Scrapling brings modern Python async/await patterns, simplified session management, and built-in pause/resume capabilities. This guide will help you migrate your existing Scrapy spiders to Scrapling.
Core Concepts Comparison
| Scrapy Concept | Scrapling Equivalent | Notes |
|---|
scrapy.Spider | scrapling.spiders.Spider | Similar base class with name, start_urls |
scrapy.Request | scrapling.spiders.Request | Similar API, but simpler |
scrapy.Response | scrapling.engines.Response | Extends Selector with additional methods |
parse() method | parse() method | Must be async generator in Scrapling |
yield Request | yield Request | Same pattern |
yield item | yield dict | Just yield dictionaries |
| Item classes | Python dicts | No need for Item classes |
| Item Pipelines | on_scraped_item() hook | Simpler approach |
| Middlewares | Session configuration | Different architecture |
scrapy crawl | spider.start() | Programmatic approach |
| Settings | Class attributes | Direct configuration |
Spider Structure Comparison
Basic Spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Key differences:
parse() must be an async generator in Scrapling
- Type hint
Response for better IDE support
- Must specify
callback=self.parse explicitly in follow requests
Running the Spider
# Command line
scrapy crawl quotes
# Or programmatically
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
Advanced Features Comparison
Multiple Callbacks
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ['https://example.com/products']
def parse(self, response):
for product_url in response.css('a.product::attr(href)').getall():
yield response.follow(product_url, callback=self.parse_product)
def parse_product(self, response):
yield {
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
}
import scrapy
class MySpider(scrapy.Spider):
name = "metadata"
start_urls = ['https://example.com']
def parse(self, response):
for url in response.css('a::attr(href)').getall():
yield scrapy.Request(
url,
callback=self.parse_page,
meta={'category': 'electronics'}
)
def parse_page(self, response):
yield {
'url': response.url,
'category': response.meta['category'],
'title': response.css('h1::text').get(),
}
Concurrency Control
# In settings.py
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5
# Or in spider
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {
'CONCURRENT_REQUESTS': 16,
'CONCURRENT_REQUESTS_PER_DOMAIN': 8,
'DOWNLOAD_DELAY': 0.5,
}
Allowed Domains
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
allowed_domains = ['example.com']
start_urls = ['https://example.com']
def parse(self, response):
# Links to other domains are automatically filtered
for link in response.css('a::attr(href)').getall():
yield response.follow(link)
Item Processing
Item Pipelines vs Hooks
# pipelines.py
class MyPipeline:
def process_item(self, item, spider):
# Clean price
if 'price' in item:
item['price'] = float(item['price'].replace('$', ''))
return item
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.MyPipeline': 300,
}
Session Management (Middlewares Alternative)
Using Different Session Types
Scrapy uses middlewares for request/response processing. Scrapling uses a session-based architecture:
# Scrapy requires middleware for browser automation
# Usually requires additional libraries like scrapy-playwright
import scrapy
from scrapy_playwright.page import PageMethod
class MySpider(scrapy.Spider):
name = "browser_spider"
def start_requests(self):
yield scrapy.Request(
"https://example.com",
meta=dict(
playwright=True,
playwright_page_methods=[
PageMethod("wait_for_selector", "div.content"),
],
),
)
Proxy Configuration
# settings.py
HTTPS_PROXY = 'http://proxy.example.com:8000'
# Or in spider with middleware
class MySpider(scrapy.Spider):
name = "proxy_spider"
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={'proxy': 'http://proxy.example.com:8000'}
)
Pause & Resume
Scrapy requires jobs directory configuration and command-line management. Scrapling makes it simple:
# settings.py
JOBDIR = "crawls/myspider"
# Command line
scrapy crawl myspider
# Press Ctrl+C to pause
# Run again to resume:
scrapy crawl myspider
Lifecycle Hooks
import scrapy
class MySpider(scrapy.Spider):
name = "lifecycle"
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Setup code
def spider_opened(self, spider):
self.logger.info('Spider opened')
def spider_closed(self, spider, reason):
self.logger.info(f'Spider closed: {reason}')
Logging
# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'spider.log'
import scrapy
class MySpider(scrapy.Spider):
name = "logging"
def parse(self, response):
self.logger.info('Processing page')
self.logger.debug('Debug info')
self.logger.warning('Warning message')
Selector Syntax
Good news! Scrapling uses the same selector syntax as Scrapy:
# Both work identically
response.css('div.quote span.text::text').get()
response.css('div.quote span.text::text').getall()
response.xpath('//div[@class="quote"]//span[@class="text"]/text()').get()
# Chaining
response.css('div.quote').css('span.text::text').getall()
Streaming Results
Scrapy doesn’t have built-in streaming. Scrapling does:
import asyncio
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "streaming"
start_urls = ['https://example.com']
async def parse(self, response: Response):
for item in response.css('.item'):
yield {'title': item.css('h2::text').get()}
async def main():
spider = MySpider()
async for item in spider.stream():
# Process items as they arrive
print(f"Got item: {item}")
# Check stats during crawl
print(f"Progress: {spider.stats.items_scraped} items")
asyncio.run(main())
Complete Migration Example
Here’s a complete Scrapy spider migrated to Scrapling:
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
class ProductSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['example.com']
start_urls = ['https://example.com/products']
custom_settings = {
'CONCURRENT_REQUESTS': 8,
'DOWNLOAD_DELAY': 1,
}
def parse(self, response):
for product in response.css('div.product'):
loader = ItemLoader(item=Product(), selector=product)
loader.add_css('name', 'h2::text')
loader.add_css('price', 'span.price::text')
loader.add_value('url', response.url)
yield loader.load_item()
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Key Advantages of Scrapling
- Modern Async/Await: Native async/await instead of Twisted deferreds
- Simpler Architecture: No need for separate settings.py, items.py, pipelines.py
- Built-in Sessions: Multiple fetcher types (HTTP, browser, stealth) in one spider
- Easy Pause/Resume: Just pass
crawldir parameter
- Real-time Streaming: Stream items as they’re scraped with
spider.stream()
- Better Performance: Optimized parsing that’s faster than Scrapy’s Parsel
- Type Hints: Full type coverage for better IDE support
- Simpler API: Less boilerplate, more Pythonic
What Scrapling Doesn’t Have
- No built-in commands system (like
scrapy genspider)
- No extensions system (use Python decorators/inheritance)
- No contracts for testing (use standard Python testing)
- Simpler than Scrapy’s full framework approach
Next Steps
Scrapling gives you the power of Scrapy with a modern, simpler API. Happy scraping!