Skip to main content

Overview

Scrapling provides multiple selector methods to find elements in HTML documents. You can use CSS3 selectors, XPath expressions, or search by text content.

CSS Selectors

Search the DOM tree using CSS3 selectors.
from scrapling import Fetcher

page = Fetcher.fetch('https://example.com')

# Find elements with CSS
links = page.css('a.nav-link')
headers = page.css('h1, h2, h3')

Method Signature

def css(
    selector: str,
    identifier: str = "",
    adaptive: bool = False,
    auto_save: bool = False,
    percentage: int = 0,
) -> Selectors
selector
str
required
The CSS3 selector to be used
identifier
str
default:""
A string that will be used to save/retrieve element’s data in adaptive mode. If not provided, the selector will be used as identifier.
It’s recommended to use the identifier argument if you plan to use a different selector later and want to relocate the same element(s)
adaptive
bool
default:"false"
When enabled, the function will try to relocate the element if it was saved before
auto_save
bool
default:"false"
Automatically save new elements for adaptive mode later
percentage
int
default:"0"
The minimum percentage to accept while adaptive is working. The percentage calculation depends on the page structure.

Examples

# Select by class
products = page.css('.product-card')

# Select by ID
header = page.css('#main-header')

# Complex selectors
active_links = page.css('nav a.active[href*="/products/"]')

XPath Selectors

Search the DOM tree using XPath expressions. XPath provides more powerful querying capabilities than CSS.

Method Signature

def xpath(
    selector: str,
    identifier: str = "",
    adaptive: bool = False,
    auto_save: bool = False,
    percentage: int = 0,
    **kwargs: Any,
) -> Selectors
selector
str
required
The XPath selector to be used
identifier
str
default:""
A string that will be used to save/retrieve element’s data in adaptive mode. If not provided, the selector will be used as identifier.
adaptive
bool
default:"false"
When enabled, the function will try to relocate the element if it was saved before
auto_save
bool
default:"false"
Automatically save new elements for adaptive mode later
percentage
int
default:"0"
The minimum percentage to accept while adaptive is working
**kwargs
Any
Additional keyword arguments will be passed as XPath variables in the XPath expression

Examples

# Find all links
links = page.xpath('//a')

# Find elements by attribute
products = page.xpath('//div[@class="product"]')

# Complex XPath
titles = page.xpath('//article//h2[contains(@class, "title")]')

Find Methods

Find elements using flexible filters including tag names, attributes, regex patterns, and custom functions.

find_all()

Find all elements matching the specified criteria.
def find_all(
    *args: str | Iterable[str] | Pattern | Callable | Dict[str, str],
    **kwargs: str,
) -> Selectors
args
str | Iterable[str] | Pattern | Callable | Dict[str, str]
  • Tag name(s) as strings
  • Iterable of tag names
  • Regex patterns to match against text
  • Callable function that takes a Selector and returns bool
  • Dictionary of attribute name-value pairs
kwargs
str
Attribute names and their values to filter elements. Use class_ for the class attribute and for_ for the for attribute.
# Find all div elements
divs = page.find_all('div')

# Find multiple tag types
headings = page.find_all('h1', 'h2', 'h3')

# Using iterable
tags = ['article', 'section']
elements = page.find_all(tags)

find()

Find the first element matching the criteria, or return None.
def find(
    *args: str | Iterable[str] | Pattern | Callable | Dict[str, str],
    **kwargs: str,
) -> Optional[Selector]
Accepts the same parameters as find_all() but returns only the first match.
# Find first matching element
header = page.find('header', class_="main")

if header:
    print(header.text)
else:
    print("Header not found")
Find elements by their text content.

find_by_text()

Find elements with matching text content.
def find_by_text(
    text: str,
    first_match: bool = True,
    partial: bool = False,
    case_sensitive: bool = False,
    clean_match: bool = True,
) -> Selector | Selectors
text
str
required
Text query to match
first_match
bool
default:"true"
Returns the first element that matches conditions
partial
bool
default:"false"
If enabled, returns elements that contain the input text
case_sensitive
bool
default:"false"
If enabled, letter case will be taken into consideration
clean_match
bool
default:"true"
If enabled, ignores all whitespaces and consecutive spaces while matching
# Find element with exact text
button = page.find_by_text('Submit', first_match=True)

find_by_regex()

Find elements whose text content matches a regex pattern.
def find_by_regex(
    query: str | Pattern[str],
    first_match: bool = True,
    case_sensitive: bool = False,
    clean_match: bool = True,
) -> Selector | Selectors
query
str | Pattern[str]
required
Regex query/pattern to match
first_match
bool
default:"true"
Return the first element that matches conditions
case_sensitive
bool
default:"false"
If enabled, letter case will be taken into consideration
clean_match
bool
default:"true"
If enabled, ignores all whitespaces and consecutive spaces while matching
import re

# Find prices
price = page.find_by_regex(r'\$\d+\.\d{2}')

# Find all email addresses
emails = page.find_by_regex(
    r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
    first_match=False
)

# Case-sensitive pattern
code = page.find_by_regex(r'[A-Z]{3}-\d{4}', case_sensitive=True)

Advanced: Find Similar Elements

Find elements that are similar to the current element based on structure and attributes.
def find_similar(
    similarity_threshold: float = 0.2,
    ignore_attributes: List | Tuple = ("href", "src"),
    match_text: bool = False,
) -> Selectors
similarity_threshold
float
default:"0.2"
The percentage threshold for attribute matching. Elements are pre-filtered by same depth, tag name, and parent structure before attribute comparison.
ignore_attributes
List | Tuple
default:"['href', 'src']"
Attribute names to ignore while matching. URLs are ignored by default as they often differ between similar elements.
match_text
bool
default:"false"
If True, element text content will be included in similarity calculation
This function is inspired by AutoScraper and is useful for finding repeated patterns like product cards in a list.
# Find one product card
first_product = page.css('.product').first

# Find all similar product cards
all_products = first_product.find_similar(similarity_threshold=0.3)

for product in all_products:
    print(product.css('.title').text)

Selectors vs Selector

  • Selector: Represents a single element
  • Selectors: A list-like container of multiple Selector objects
Both classes have similar methods, with Selectors applying operations across all contained elements:
# Single element (Selector)
element = page.css('.container').first
text = element.text  # TextHandler

# Multiple elements (Selectors)
elements = page.css('.item')
texts = elements.getall()  # List of TextHandler

Build docs developers (and LLMs) love