Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/D4Vinci/Scrapling/llms.txt

Use this file to discover all available pages before exploring further.

Prerequisites

Make sure you have Scrapling installed. If not, see the Installation guide. For the examples below, you’ll need:
pip install "scrapling[fetchers]"
scrapling install

Your First Scrape

Let’s scrape a simple website using HTTP requests:
1

Import the Fetcher

from scrapling.fetchers import Fetcher
2

Fetch the page

page = Fetcher.get('https://quotes.toscrape.com/')
3

Extract data with CSS selectors

# Get all quotes
quotes = page.css('.quote .text::text').getall()
print(quotes)

# Get all authors
authors = page.css('.quote .author::text').getall()
print(authors)
The ::text pseudo-element extracts text content, similar to Scrapy/Parsel syntax.

Using Sessions

For multiple requests to the same domain, use sessions to maintain cookies and state:
from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate='chrome') as session:
    # First request
    page1 = session.get('https://quotes.toscrape.com/')
    quotes = page1.css('.quote .text::text').getall()
    
    # Follow pagination - cookies maintained
    page2 = session.get('https://quotes.toscrape.com/page/2/')
    more_quotes = page2.css('.quote .text::text').getall()

Stealthy Scraping

For websites with anti-bot protection, use the StealthyFetcher:
1

Import StealthyFetcher

from scrapling.fetchers import StealthyFetcher
2

Fetch protected pages

# Bypass Cloudflare automatically
page = StealthyFetcher.fetch(
    'https://nopecha.com/demo/cloudflare',
    solve_cloudflare=True,
    headless=True
)
3

Extract data

data = page.css('#padded_content a').getall()
print(page.status)  # 200
from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch(
    'https://www.browserscan.net/bot-detection',
    headless=True,
    network_idle=True
)
print(f"Status: {page.status}")  # 200

Building a Spider

For larger scraping projects, use Scrapling’s spider framework:
1

Create a spider class

from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10
2

Define the parse method

    async def parse(self, response: Response):
        # Extract quotes from current page
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
                "tags": quote.css('.tag::text').getall(),
            }
        
        # Follow pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)
3

Run the spider

# Run and get results
result = QuotesSpider().start()

print(f"Scraped {len(result.items)} quotes")

# Export to JSON
result.items.to_json("quotes.json")

# Or JSONL
result.items.to_jsonl("quotes.jsonl")

Complete Spider Example

from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10
    
    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
                "tags": quote.css('.tag::text').getall(),
            }
        
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# Run the spider
result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")
Spiders support pause/resume, multiple session types, proxy rotation, and streaming mode. See Spider Documentation for advanced features.

Multi-Session Spider

Use different session types in a single spider for optimal performance:
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        # Fast HTTP session for most pages
        manager.add("fast", FetcherSession(impersonate="chrome"))
        # Stealth session for protected pages
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            # Route protected pages through stealth session
            if "protected" in link:
                yield Request(link, sid="stealth", callback=self.parse_protected)
            else:
                yield Request(link, sid="fast")
    
    async def parse_protected(self, response: Response):
        # Handle protected pages
        data = response.css('.content::text').get()
        yield {"protected_data": data}

MultiSessionSpider().start()

Adaptive Scraping

Scrapling can automatically relocate elements when website structure changes:
1

Enable adaptive mode

from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True
2

Save element locations

page = StealthyFetcher.fetch('https://example.com', headless=True)

# Save element signatures for future use
products = page.css('.product', auto_save=True)
3

Relocate after changes

# Later, if website structure changes
page = StealthyFetcher.fetch('https://example.com', headless=True)

# Automatically find elements using saved signatures
products = page.css('.product', adaptive=True)
Scrapling offers powerful element navigation:
from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')

# Multiple selection methods
quotes = page.css('.quote')                              # CSS selector
quotes = page.xpath('//div[@class="quote"]')           # XPath
quotes = page.find_all('div', class_='quote')           # BeautifulSoup-style
quotes = page.find_by_text('quote', tag='div')          # Text search

# Element navigation
first_quote = quotes[0]
author = first_quote.css('.author::text').get()
parent = first_quote.parent
children = first_quote.children
siblings = first_quote.siblings

# Find similar elements
similar = first_quote.find_similar()
See Selection Methods for comprehensive selector documentation.

Command Line Usage

Scrape without writing code:
scrapling shell
  • .txt extension extracts text content
  • .md extension extracts Markdown representation
  • .html extension extracts raw HTML

Next Steps

You’re now ready to explore Scrapling’s advanced features:

Selection Methods

Master CSS, XPath, regex, and text search

Choose Your Fetcher

Learn when to use each fetcher type

Build Advanced Spiders

Concurrent crawls with pause/resume

Proxy Rotation

Built-in proxy rotation strategies

Interactive Shell

Speed up development with IPython

MCP Server

AI-assisted web scraping

Common Patterns

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')

while True:
    # Extract data from current page
    quotes = page.css('.quote .text::text').getall()
    print(quotes)
    
    # Check for next page
    next_link = page.css('.next a::attr(href)').get()
    if not next_link:
        break
    
    # Fetch next page
    page = Fetcher.get(f'https://quotes.toscrape.com{next_link}')
# Get href attributes
links = page.css('a::attr(href)').getall()

# Get data attributes
product_ids = page.css('.product::attr(data-id)').getall()

# Get multiple attributes
for link in page.css('a'):
    url = link.attrib.get('href')
    title = link.attrib.get('title')
# Extract JSON from script tags
json_data = page.css('script#data::text').get()

# Parse JSON attributes
schema = page.css('[schema]').attrib['schema'].json()

# Extract all text as JSON-ready
data = {
    "title": page.css('h1::text').get(),
    "price": page.css('.price::text').get(),
    "description": page.css('.description::text').get(),
}
from scrapling.fetchers import Fetcher

try:
    page = Fetcher.get('https://example.com', timeout=10)
    
    if page.status != 200:
        print(f"Error: Status {page.status}")
    else:
        data = page.css('.content::text').get()
        
except Exception as e:
    print(f"Failed to fetch: {e}")

Build docs developers (and LLMs) love