Adaptive Scraping - Scrapling

Overview

Adaptive parsing is Scrapling’s innovative feature that makes your scrapers resilient to website structure changes. Instead of breaking when a website updates its HTML, Scrapling can automatically relocate elements based on their unique characteristics.

Adaptive scraping uses a similarity algorithm to match elements even when selectors change, making your scrapers more maintainable and reliable.

How It Works

Adaptive parsing works by:

Saving element signatures (tag, attributes, text, parent structure, siblings)
Storing these signatures with an identifier
Relocating elements when selectors fail by comparing stored signatures with current elements
Scoring candidates based on similarity percentage
Returning the best matches above a threshold

Enabling Adaptive Mode

Enable adaptive parsing when creating a Selector:

from scrapling import Fetcher

# Fetch with adaptive mode enabled
page = Fetcher.fetch('https://example.com', adaptive=True)

# Or create Selector with adaptive mode
from scrapling import Selector

page = Selector(
    html_content,
    url='https://example.com',
    adaptive=True
)

adaptive

bool

default:"false"

Globally enable adaptive features for all selector methods

storage

StorageSystemMixin

default:"SQLiteStorageSystem"

The storage class to use for saving element signatures. Must be wrapped with lru_cache decorator.

storage_args

Dict

Arguments to pass to the storage class constructor

The adaptive parameter must be set during initialization. It cannot be changed later and takes priority over all adaptive-related arguments in selector methods.

Basic Usage

Auto-Save Mode

Automatically save element signatures when first found:

page = Fetcher.fetch('https://example.com', adaptive=True)

# First run: finds element and saves it automatically
product = page.css('.product-card', identifier='main-product', auto_save=True)

# Future runs: if '.product-card' fails, uses saved signature to relocate
product = page.css('.product-card', identifier='main-product', adaptive=True, auto_save=True)

Manual Save and Retrieve

Explicitly control when to save and retrieve:

page = Fetcher.fetch('https://example.com', adaptive=True)

# Find and save element
product = page.css('.product-card').first
if product:
    page.save(product, identifier='main-product')

# Later, on a changed page
page_new = Fetcher.fetch('https://example.com', adaptive=True)

# Try with adaptive mode
product = page_new.css('.product-card', identifier='main-product', adaptive=True)

# Or manually relocate
element_data = page_new.retrieve('main-product')
if element_data:
    matches = page_new.relocate(element_data, percentage=70)

Selector Methods with Adaptive Support

Both css() and xpath() support adaptive parameters.

css() with Adaptive

def css(
    selector: str,
    identifier: str = "",
    adaptive: bool = False,
    auto_save: bool = False,
    percentage: int = 0,
) -> Selectors

identifier

str

default:""

Unique identifier for saving/retrieving element data. If not provided, the selector string is used.

Always use explicit identifiers when you plan to change selectors in the future

adaptive

bool

default:"false"

Enable adaptive relocation for this specific selector call

auto_save

bool

default:"false"

Automatically save the first matched element with the identifier

percentage

int

default:"0"

Minimum similarity percentage required when relocating (0-100). Higher values are more strict.

xpath() with Adaptive

def xpath(
    selector: str,
    identifier: str = "",
    adaptive: bool = False,
    auto_save: bool = False,
    percentage: int = 0,
    **kwargs: Any,
) -> Selectors

Accepts the same adaptive parameters as css(), plus:

**kwargs

Any

Additional keyword arguments passed as XPath variables

Core Adaptive Methods

save()

Save an element’s signature to storage.

def save(element: HtmlElement, identifier: str) -> None

element

HtmlElement | Selector

required

The element to save. Can be a Selector or raw HtmlElement.

identifier

str

required

Unique identifier for retrieving the element later

# Save element for later relocation
product = page.css('.product').first
if product:
    page.save(product, 'main-product')
    
# Save with descriptive identifier
price = page.css('.price').first
if price:
    page.save(price, 'product-price-v1')

retrieve()

Retrieve a saved element’s signature from storage.

def retrieve(identifier: str) -> Optional[Dict[str, Any]]

identifier

str

required

The identifier used when saving the element

Returns a dictionary containing:

tag: Element tag name
text: Element text content
attributes: Element attributes
path: Element’s path in the DOM tree
parent_name: Parent element’s tag name
parent_attribs: Parent element’s attributes
parent_text: Parent element’s text
siblings: Information about sibling elements

# Retrieve saved element data
element_data = page.retrieve('main-product')

if element_data:
    print(f"Saved tag: {element_data['tag']}")
    print(f"Saved attributes: {element_data['attributes']}")

relocate()

Find elements matching a saved signature.

def relocate(
    element: Union[Dict, HtmlElement, Selector],
    percentage: int = 0,
    selector_type: bool = False,
) -> Union[List[HtmlElement], Selectors]

element

Dict | HtmlElement | Selector

required

The element signature to search for. Usually a dictionary from retrieve().

percentage

int

default:"0"

Minimum similarity percentage (0-100). Only elements scoring above this are returned.

The percentage calculation depends on page structure. Start with low values (0-30) and increase if needed.

selector_type

bool

default:"false"

If True, return results as Selectors object instead of raw HtmlElement list

# Manual relocation workflow
element_data = page.retrieve('main-product')

if element_data:
    # Get matches as raw elements
    matches = page.relocate(element_data, percentage=70)
    
    # Or as Selectors
    matches = page.relocate(element_data, percentage=70, selector_type=True)
    
    if matches:
        print(f"Found {len(matches)} matches")
        best_match = matches[0] if isinstance(matches, list) else matches.first

Similarity Scoring

Scrapling calculates similarity based on multiple factors:

Scoring Factors

Tag Name Match (exact match)
Text Similarity (using SequenceMatcher)
Attributes Similarity (keys and values)
Class, ID, Href, Src (separate scoring for important attributes)
Path Similarity (DOM tree path)
Parent Structure (parent tag, attributes, text)
Siblings Information (surrounding elements)

How Similarity is Calculated

# Internal scoring algorithm (simplified)
score = 0
checks = 0

# Exact tag match
score += 1 if same_tag else 0
checks += 1

# Text similarity
if has_text:
    score += SequenceMatcher(original_text, current_text).ratio()
    checks += 1

# Attribute similarity
score += calculate_dict_similarity(original_attrs, current_attrs)
checks += 1

# Important attributes (class, id, href, src)
for attr in ['class', 'id', 'href', 'src']:
    if attr in original:
        score += SequenceMatcher(original[attr], current[attr]).ratio()
        checks += 1

# Path similarity
score += SequenceMatcher(original_path, current_path).ratio()
checks += 1

# Parent and siblings...

final_score = (score / checks) * 100  # Convert to percentage

Practical Examples

Example 1: Product Scraper

from scrapling import Fetcher

# Initial scraper development
page = Fetcher.fetch('https://shop.example.com/product/123', adaptive=True)

# Find and save key elements with auto_save
product_name = page.css(
    '.product-title',
    identifier='product-name',
    adaptive=True,
    auto_save=True
)

price = page.css(
    '.price-current',
    identifier='product-price',
    adaptive=True,
    auto_save=True
)

description = page.css(
    '.product-description',
    identifier='product-desc',
    adaptive=True,
    auto_save=True
)

# Later, even if website changes CSS classes:
# New page structure: .product-title → .prod-name
page_new = Fetcher.fetch('https://shop.example.com/product/123', adaptive=True)

# These will still work using adaptive relocation
product_name = page_new.css(
    '.product-title',  # Old selector
    identifier='product-name',
    adaptive=True,
    auto_save=True  # Updates saved signature
)

if product_name:
    print(product_name.first.text)

Example 2: News Article Scraper

class NewsScraper:
    def __init__(self, url):
        self.page = Fetcher.fetch(url, adaptive=True)
    
    def extract_article(self):
        """Extract article with adaptive fallbacks"""
        
        # Try primary selector, save if found
        title = self.page.css(
            'h1.article-title',
            identifier='article-title',
            adaptive=True,
            auto_save=True
        )
        
        # Try primary selector with adaptive fallback
        author = self.page.css(
            '.author-name',
            identifier='article-author',
            adaptive=True,
            auto_save=True
        )
        
        # Content with higher threshold for accuracy
        content = self.page.css(
            '.article-content',
            identifier='article-content',
            adaptive=True,
            auto_save=True,
            percentage=50  # Require 50% similarity
        )
        
        return {
            'title': title.first.text if title else 'N/A',
            'author': author.first.text if author else 'Unknown',
            'content': content.first.get_all_text() if content else ''
        }

# Use the scraper
scraper = NewsScraper('https://news.example.com/article/123')
article = scraper.extract_article()

Example 3: Monitoring Website Changes

import logging
from scrapling import Fetcher

logging.basicConfig(level=logging.DEBUG)

def monitor_element(url, identifier, current_selector):
    """Monitor if element can still be found"""
    page = Fetcher.fetch(url, adaptive=True)
    
    # Try to find element
    result = page.css(
        current_selector,
        identifier=identifier,
        adaptive=True,
        auto_save=True,
        percentage=30
    )
    
    if result:
        # Check if it was found via adaptive mode
        element_data = page.retrieve(identifier)
        if element_data:
            # Compare original selector with what we found
            original_attrs = element_data.get('attributes', {})
            current_attrs = result.first.attrib
            
            if original_attrs == current_attrs:
                print(f"✓ Element found with original selector")
            else:
                print(f"⚠ Element found via adaptive mode")
                print(f"  Original: {original_attrs}")
                print(f"  Current: {dict(current_attrs)}")
                return 'CHANGED'
    else:
        print(f"✗ Element not found at all")
        return 'NOT_FOUND'
    
    return 'OK'

# Monitor key elements
status = monitor_element(
    'https://example.com',
    'main-cta-button',
    'button.cta-primary'
)

if status == 'CHANGED':
    print("Website structure has changed, update selectors!")

Example 4: Multi-Page Scraper with Adaptive

from scrapling import Fetcher
import time

class ProductCatalogScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.setup_selectors()
    
    def setup_selectors(self):
        """Initialize and save selectors from first page"""
        page = Fetcher.fetch(self.base_url, adaptive=True)
        
        # Find and save product card structure
        first_product = page.css('.product-item').first
        if first_product:
            page.save(first_product, 'product-card-template')
            
            # Save internal elements relative to product card
            name = first_product.css('.product-name').first
            if name:
                page.save(name, 'product-name-in-card')
            
            price = first_product.css('.price').first
            if price:
                page.save(price, 'product-price-in-card')
    
    def scrape_page(self, page_num):
        """Scrape a single page using adaptive selectors"""
        url = f"{self.base_url}?page={page_num}"
        page = Fetcher.fetch(url, adaptive=True)
        
        # Find products using adaptive mode
        card_data = page.retrieve('product-card-template')
        if card_data:
            products = page.relocate(
                card_data,
                percentage=40,
                selector_type=True
            )
            
            results = []
            for product in products:
                # Extract data from each product
                name_elem = product.css('.product-name', adaptive=True)
                price_elem = product.css('.price', adaptive=True)
                
                results.append({
                    'name': name_elem.first.text if name_elem else None,
                    'price': price_elem.first.text if price_elem else None,
                    'url': product.css('a').first['href'] if product.css('a') else None
                })
            
            return results
        return []
    
    def scrape_all(self, max_pages=10):
        """Scrape all pages"""
        all_products = []
        
        for page_num in range(1, max_pages + 1):
            print(f"Scraping page {page_num}...")
            products = self.scrape_page(page_num)
            
            if not products:
                print(f"No products found on page {page_num}, stopping")
                break
            
            all_products.extend(products)
            time.sleep(1)  # Be nice to the server
        
        return all_products

# Usage
scraper = ProductCatalogScraper('https://shop.example.com/products')
all_products = scraper.scrape_all(max_pages=5)
print(f"Scraped {len(all_products)} products")

Best Practices

Use Descriptive Identifiers

Use clear, versioned identifiers like 'product-price-v1' instead of relying on selectors as identifiers.

Start with Auto-Save

Use auto_save=True during development to automatically build your element database.

Tune Percentage Carefully

Start with low percentage values (0-30) and increase only if you get too many false positives.

Monitor Adaptive Usage

Enable debug logging to see when adaptive mode is being used vs. direct selectors.

Custom Storage Backend

You can implement custom storage backends by extending StorageSystemMixin:

from functools import lru_cache
from scrapling.core.storage import StorageSystemMixin

@lru_cache(maxsize=128)
class RedisStorage(StorageSystemMixin):
    def __init__(self, redis_url, url, **kwargs):
        super().__init__(url=url)
        self.redis_client = connect_to_redis(redis_url)
    
    def save(self, element, identifier):
        # Implement save to Redis
        pass
    
    def retrieve(self, identifier):
        # Implement retrieve from Redis
        pass

# Use custom storage
page = Selector(
    html,
    adaptive=True,
    storage=RedisStorage,
    storage_args={'redis_url': 'redis://localhost:6379'}
)

Custom storage classes must:

Be wrapped with @lru_cache decorator
Inherit from StorageSystemMixin
Accept url parameter

Limitations

Adaptive mode requires elements to have been saved before relocation
Very similar elements on the same page may cause false positives
Completely restructured pages may fall below similarity thresholds
Text nodes cannot be saved (their parent element is saved instead)

Troubleshooting

Element Not Found Even with Adaptive

# Enable debug logging to see similarity scores
import logging
logging.basicConfig(level=logging.DEBUG)

# Lower the percentage threshold
result = page.css(
    '.old-selector',
    identifier='my-element',
    adaptive=True,
    percentage=0  # Accept any match
)

# Check what was saved
element_data = page.retrieve('my-element')
print("Saved signature:", element_data)

Too Many False Positives

# Increase the percentage threshold
result = page.css(
    '.selector',
    identifier='my-element',
    adaptive=True,
    percentage=60  # Require 60% similarity
)

# Or use more specific identifiers
result = page.css(
    '.selector',
    identifier='page-product-listing-price-2024-01',
    adaptive=True
)

Updating Saved Elements

# Re-save with updated structure
new_element = page.css('.new-selector').first
if new_element:
    # Overwrites previous save
    page.save(new_element, 'my-element')

Getting Started

Core Concepts

Fetching

Parsing & Selection

Spiders

CLI & Tools

AI Integration

Guides

Tutorials

Documentation Index

​Overview

​How It Works

​Enabling Adaptive Mode

​Basic Usage

​Auto-Save Mode

​Manual Save and Retrieve

​Selector Methods with Adaptive Support

​css() with Adaptive

​xpath() with Adaptive

​Core Adaptive Methods

​save()

​retrieve()

​relocate()

​Similarity Scoring

​Scoring Factors

​How Similarity is Calculated

​Practical Examples

​Example 1: Product Scraper

​Example 2: News Article Scraper

​Example 3: Monitoring Website Changes

​Example 4: Multi-Page Scraper with Adaptive

​Best Practices

Use Descriptive Identifiers

Start with Auto-Save

Tune Percentage Carefully

Monitor Adaptive Usage

​Custom Storage Backend

​Limitations

​Troubleshooting

​Element Not Found Even with Adaptive

​Too Many False Positives

​Updating Saved Elements

Build docs developers (and LLMs) love

Overview

How It Works

Enabling Adaptive Mode

Basic Usage

Auto-Save Mode

Manual Save and Retrieve

Selector Methods with Adaptive Support

css() with Adaptive

xpath() with Adaptive

Core Adaptive Methods

save()

retrieve()

relocate()

Similarity Scoring

Scoring Factors

How Similarity is Calculated

Practical Examples

Example 1: Product Scraper

Example 2: News Article Scraper

Example 3: Monitoring Website Changes

Example 4: Multi-Page Scraper with Adaptive

Best Practices

Custom Storage Backend

Limitations

Troubleshooting

Element Not Found Even with Adaptive

Too Many False Positives

Updating Saved Elements