Skip to main content
The Selector class is the core component of Scrapling that wraps HTML content and provides powerful methods for selecting and extracting data using CSS, XPath, or text-based queries.

Constructor

content
str | bytes
required
HTML content as either string or bytes
url
str
default:""
Store a URL with the HTML data for retrieving later
encoding
str
default:"utf-8"
The encoding type used in HTML parsing
huge_tree
bool
default:"true"
Should always be enabled when parsing large HTML documents. Controls the libxml2 feature that forbids parsing certain large documents to protect from possible memory exhaustion
root
HtmlElement
default:"None"
Used internally to pass etree objects. Don’t use unless you know what you’re doing
keep_comments
bool
default:"false"
Whether to drop comments while parsing the HTML body
keep_cdata
bool
default:"false"
Whether to drop CDATA while parsing the HTML body
adaptive
bool
default:"false"
Globally turn off the adaptive feature in all functions. Takes higher priority over all adaptive related arguments/functions in the class
storage
Any
default:"SQLiteStorageSystem"
The storage class to be passed for adaptive functionalities
storage_args
Dict
default:"None"
A dictionary of argument-value pairs to be passed for the storage class. If empty, default values will be used

Properties

tag

@property
def tag(self) -> str
Get the tag name of the element. Returns: The element’s tag name as a string, or "#text" for text nodes

text

@property
def text(self) -> TextHandler
Get text content of the element. Returns: A TextHandler object containing the element’s text

attrib

@property
def attrib(self) -> AttributesHandler
Get attributes of the element. Returns: An AttributesHandler object containing the element’s attributes

html_content

@property
def html_content(self) -> TextHandler
Return the inner HTML code of the element. Returns: A TextHandler containing the element’s inner HTML

body

@property
def body(self) -> str | bytes
Return the raw body of the current Selector without any processing. Useful for binary and non-HTML requests. Returns: The raw body content

parent

@property
def parent(self) -> Optional[Selector]
Return the direct parent of the element. Returns: The parent Selector or None if there is no parent

children

@property
def children(self) -> Selectors
Return the children elements of the current element. Returns: A Selectors object containing child elements, or empty list if none

siblings

@property
def siblings(self) -> Selectors
Return other children of the current element’s parent. Returns: A Selectors object containing sibling elements, or empty list if none

below_elements

@property
def below_elements(self) -> Selectors
Return all elements under the current element in the DOM tree. Returns: A Selectors object containing all descendant elements

path

@property
def path(self) -> Selectors
Returns a list of type Selectors that contains the path leading to the current element from the root. Returns: A Selectors object representing the element’s path

next

@property
def next(self) -> Optional[Selector]
Returns the next element of the current element in the children of the parent. Returns: The next Selector or None if there isn’t one

previous

@property
def previous(self) -> Optional[Selector]
Returns the previous element of the current element in the children of the parent. Returns: The previous Selector or None if there isn’t one

Selection Methods

css()

def css(
    self,
    selector: str,
    identifier: str = "",
    adaptive: bool = False,
    auto_save: bool = False,
    percentage: int = 0,
) -> Selectors
Search the current tree with CSS3 selectors.
selector
str
required
The CSS3 selector to be used
identifier
str
default:""
A string that will be used to save/retrieve element’s data in adaptive, otherwise the selector will be used. Recommended if you plan to use a different selector later and want to relocate the same element(s)
adaptive
bool
default:"false"
Enabled will make the function try to relocate the element if it was saved before
auto_save
bool
default:"false"
Automatically save new elements for adaptive later
percentage
int
default:"0"
The minimum percentage to accept while adaptive is working. Don’t play with this number unless you know what you’re doing
Returns: A Selectors object

xpath()

def xpath(
    self,
    selector: str,
    identifier: str = "",
    adaptive: bool = False,
    auto_save: bool = False,
    percentage: int = 0,
    **kwargs: Any,
) -> Selectors
Search the current tree with XPath selectors.
selector
str
required
The XPath selector to be used
identifier
str
default:""
A string that will be used to save/retrieve element’s data in adaptive, otherwise the selector will be used. Recommended if you plan to use a different selector later and want to relocate the same element(s)
adaptive
bool
default:"false"
Enabled will make the function try to relocate the element if it was saved before
auto_save
bool
default:"false"
Automatically save new elements for adaptive later
percentage
int
default:"0"
The minimum percentage to accept while adaptive is working. Don’t play with this number unless you know what you’re doing
**kwargs
Any
Additional keyword arguments will be passed as XPath variables in the XPath expression
Returns: A Selectors object

find_all()

def find_all(
    self,
    *args: str | Iterable[str] | Pattern | Callable | Dict[str, str],
    **kwargs: str,
) -> Selectors
Find elements by filters of your creation.
args
str | Iterable[str] | Pattern | Callable | Dict[str, str]
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements’ attributes. Leave empty for selecting all
kwargs
str
The attributes you want to filter elements based on
Returns: A Selectors object of the elements or empty list

find()

def find(
    self,
    *args: str | Iterable[str] | Pattern | Callable | Dict[str, str],
    **kwargs: str,
) -> Optional[Selector]
Find elements by filters of your creation, then return the first result.
args
str | Iterable[str] | Pattern | Callable | Dict[str, str]
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements’ attributes. Leave empty for selecting all
kwargs
str
The attributes you want to filter elements based on
Returns: The first Selector object or None if the result didn’t match

find_by_text()

def find_by_text(
    self,
    text: str,
    first_match: bool = True,
    partial: bool = False,
    case_sensitive: bool = False,
    clean_match: bool = True,
) -> Union[Selectors, Selector]
Find elements that have text content fully/partially matching the input.
text
str
required
Text query to match
first_match
bool
default:"true"
Returns the first element that matches conditions
partial
bool
default:"false"
If enabled, the function returns elements that contain the input text
case_sensitive
bool
default:"false"
If enabled, the letters case will be taken into consideration
clean_match
bool
default:"true"
If enabled, this will ignore all whitespaces and consecutive spaces while matching
Returns: A Selector if first_match=True, otherwise Selectors

find_by_regex()

def find_by_regex(
    self,
    query: str | Pattern[str],
    first_match: bool = True,
    case_sensitive: bool = False,
    clean_match: bool = True,
) -> Union[Selectors, Selector]
Find elements whose text content matches the input regex pattern.
query
str | Pattern[str]
required
Regex query/pattern to match
first_match
bool
default:"true"
Return the first element that matches conditions
case_sensitive
bool
default:"false"
If enabled, the letters case will be taken into consideration in the regex
clean_match
bool
default:"true"
If enabled, this will ignore all whitespaces and consecutive spaces while matching
Returns: A Selector if first_match=True, otherwise Selectors

find_similar()

def find_similar(
    self,
    similarity_threshold: float = 0.2,
    ignore_attributes: List | Tuple = ("href", "src"),
    match_text: bool = False,
) -> Selectors
Find elements that are in the same tree depth with the same tag name and match the current element attributes with a percentage higher than the threshold.
similarity_threshold
float
default:"0.2"
The percentage to use while comparing element attributes. Don’t play with this number unless you’re getting unwanted results
ignore_attributes
List | Tuple
default:"('href', 'src')"
Attribute names to ignore while matching attributes. Default ignores href and src as URLs can change between elements
match_text
bool
default:"false"
If True, element text content will be taken into calculation while matching. Not recommended in normal cases
Returns: A Selectors container of Selector objects or empty list

Extraction Methods

get_all_text()

def get_all_text(
    self,
    separator: str = "\n",
    strip: bool = False,
    ignore_tags: Tuple = ("script", "style"),
    valid_values: bool = True,
) -> TextHandler
Get all child strings of this element, concatenated using the given separator.
separator
str
default:"\\n"
Strings will be concatenated using this separator
strip
bool
default:"false"
If True, strings will be stripped before being concatenated
ignore_tags
Tuple
default:"('script', 'style')"
A tuple of all tag names you want to ignore
valid_values
bool
default:"true"
If enabled, elements with text-content that is empty or only whitespaces will be ignored
Returns: A TextHandler object

get()

def get(self) -> TextHandler
Serialize this element to a string. For text nodes, returns the text value. For HTML elements, returns the outer HTML. Returns: A TextHandler containing the serialized string

getall()

def getall(self) -> TextHandlers
Return a single-element list containing this element’s serialized string. Returns: A TextHandlers list with one element

json()

def json(self) -> Dict
Return JSON response if the response is jsonable. Returns: A dictionary parsed from JSON Raises: Exception if content is not valid JSON

re()

def re(
    self,
    regex: str | Pattern[str],
    replace_entities: bool = True,
    clean_match: bool = False,
    case_sensitive: bool = True,
) -> TextHandlers
Apply the given regex to the current text and return a list of strings with the matches.
regex
str | Pattern[str]
required
Can be either a compiled regular expression or a string
replace_entities
bool
default:"true"
If enabled, character entity references are replaced by their corresponding character
clean_match
bool
default:"false"
If enabled, this will ignore all whitespaces and consecutive spaces while matching
case_sensitive
bool
default:"true"
If disabled, the function will set the regex to ignore the letters case while compiling it
Returns: A TextHandlers list of matches

re_first()

def re_first(
    self,
    regex: str | Pattern[str],
    default=None,
    replace_entities: bool = True,
    clean_match: bool = False,
    case_sensitive: bool = True,
) -> TextHandler
Apply the given regex to text and return the first match if found, otherwise return the default value.
regex
str | Pattern[str]
required
Can be either a compiled regular expression or a string
default
Any
default:"None"
The default value to be returned if there is no match
replace_entities
bool
default:"true"
If enabled, character entity references are replaced by their corresponding character
clean_match
bool
default:"false"
If enabled, this will ignore all whitespaces and consecutive spaces while matching
case_sensitive
bool
default:"true"
If disabled, the function will set the regex to ignore the letters case while compiling it
Returns: A TextHandler with the first match or the default value

Utility Methods

prettify()

def prettify(self) -> TextHandler
Return a prettified version of the element’s inner HTML code. Returns: A TextHandler with formatted HTML

has_class()

def has_class(self, class_name: str) -> bool
Check if the element has a specific class.
class_name
str
required
The class name to check for
Returns: True if element has class with that name, otherwise False

urljoin()

def urljoin(self, relative_url: str) -> str
Join this Selector’s url with a relative url to form an absolute full URL.
relative_url
str
required
The relative URL to join
Returns: The absolute URL as a string

iterancestors()

def iterancestors(self) -> Generator[Selector, None, None]
Return a generator that loops over all ancestors of the element, starting with the element’s parent. Returns: A generator yielding Selector objects

find_ancestor()

def find_ancestor(self, func: Callable[[Selector], bool]) -> Optional[Selector]
Loop over all ancestors of the element until one matches the passed function.
func
Callable[[Selector], bool]
required
A function that takes each ancestor as an argument and returns True/False
Returns: The first ancestor that matches the function or None otherwise

Adaptive Methods

save()

def save(self, element: HtmlElement, identifier: str) -> None
Saves the element’s unique properties to the storage for retrieval and relocation later.
element
HtmlElement
required
The element itself to save to storage. Can be a Selector or pure HtmlElement
identifier
str
required
The identifier that will be used to retrieve the element later from the storage

retrieve()

def retrieve(self, identifier: str) -> Optional[Dict[str, Any]]
Using the identifier, search the storage and return the unique properties of the element.
identifier
str
required
The identifier used to retrieve the element from the storage
Returns: A dictionary of the unique properties or None

relocate()

def relocate(
    self,
    element: Union[Dict, HtmlElement, Selector],
    percentage: int = 0,
    selector_type: bool = False,
) -> Union[List[HtmlElement], Selectors]
Search again for the element in the page tree, used automatically on page structure change.
element
Union[Dict, HtmlElement, Selector]
required
The element to relocate in the tree
percentage
int
default:"0"
The minimum percentage to accept. Don’t play with this number unless you know what you’re doing
selector_type
bool
default:"false"
If True, the return result will be converted to Selectors object
Returns: List of pure HTML elements that got the highest matching score or Selectors object

Magic Methods

__getitem__()

def __getitem__(self, key: str) -> TextHandler
Get element attribute by key. Example:
selector['href']  # Returns the href attribute value

__contains__()

def __contains__(self, key: str) -> bool
Check if element has an attribute. Example:
'class' in selector  # Returns True if element has class attribute

Aliases

For compatibility with Scrapy/Parsel:
  • extract() - alias for getall()
  • extract_first() - alias for get()

Build docs developers (and LLMs) love