Selector class is the core component of Scrapling that wraps HTML content and provides powerful methods for selecting and extracting data using CSS, XPath, or text-based queries.
Constructor
HTML content as either string or bytes
Store a URL with the HTML data for retrieving later
The encoding type used in HTML parsing
Should always be enabled when parsing large HTML documents. Controls the libxml2 feature that forbids parsing certain large documents to protect from possible memory exhaustion
Used internally to pass etree objects. Don’t use unless you know what you’re doing
Whether to drop comments while parsing the HTML body
Whether to drop CDATA while parsing the HTML body
Globally turn off the adaptive feature in all functions. Takes higher priority over all adaptive related arguments/functions in the class
The storage class to be passed for adaptive functionalities
A dictionary of argument-value pairs to be passed for the storage class. If empty, default values will be used
Properties
tag
"#text" for text nodes
text
TextHandler object containing the element’s text
attrib
AttributesHandler object containing the element’s attributes
html_content
TextHandler containing the element’s inner HTML
body
parent
Selector or None if there is no parent
children
Selectors object containing child elements, or empty list if none
siblings
Selectors object containing sibling elements, or empty list if none
below_elements
Selectors object containing all descendant elements
path
Selectors that contains the path leading to the current element from the root.
Returns: A Selectors object representing the element’s path
next
Selector or None if there isn’t one
previous
Selector or None if there isn’t one
Selection Methods
css()
The CSS3 selector to be used
A string that will be used to save/retrieve element’s data in adaptive, otherwise the selector will be used. Recommended if you plan to use a different selector later and want to relocate the same element(s)
Enabled will make the function try to relocate the element if it was saved before
Automatically save new elements for adaptive later
The minimum percentage to accept while adaptive is working. Don’t play with this number unless you know what you’re doing
Selectors object
xpath()
The XPath selector to be used
A string that will be used to save/retrieve element’s data in adaptive, otherwise the selector will be used. Recommended if you plan to use a different selector later and want to relocate the same element(s)
Enabled will make the function try to relocate the element if it was saved before
Automatically save new elements for adaptive later
The minimum percentage to accept while adaptive is working. Don’t play with this number unless you know what you’re doing
Additional keyword arguments will be passed as XPath variables in the XPath expression
Selectors object
find_all()
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements’ attributes. Leave empty for selecting all
The attributes you want to filter elements based on
Selectors object of the elements or empty list
find()
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements’ attributes. Leave empty for selecting all
The attributes you want to filter elements based on
Selector object or None if the result didn’t match
find_by_text()
Text query to match
Returns the first element that matches conditions
If enabled, the function returns elements that contain the input text
If enabled, the letters case will be taken into consideration
If enabled, this will ignore all whitespaces and consecutive spaces while matching
Selector if first_match=True, otherwise Selectors
find_by_regex()
Regex query/pattern to match
Return the first element that matches conditions
If enabled, the letters case will be taken into consideration in the regex
If enabled, this will ignore all whitespaces and consecutive spaces while matching
Selector if first_match=True, otherwise Selectors
find_similar()
The percentage to use while comparing element attributes. Don’t play with this number unless you’re getting unwanted results
Attribute names to ignore while matching attributes. Default ignores
href and src as URLs can change between elementsIf True, element text content will be taken into calculation while matching. Not recommended in normal cases
Selectors container of Selector objects or empty list
Extraction Methods
get_all_text()
Strings will be concatenated using this separator
If True, strings will be stripped before being concatenated
A tuple of all tag names you want to ignore
If enabled, elements with text-content that is empty or only whitespaces will be ignored
TextHandler object
get()
TextHandler containing the serialized string
getall()
TextHandlers list with one element
json()
re()
Can be either a compiled regular expression or a string
If enabled, character entity references are replaced by their corresponding character
If enabled, this will ignore all whitespaces and consecutive spaces while matching
If disabled, the function will set the regex to ignore the letters case while compiling it
TextHandlers list of matches
re_first()
Can be either a compiled regular expression or a string
The default value to be returned if there is no match
If enabled, character entity references are replaced by their corresponding character
If enabled, this will ignore all whitespaces and consecutive spaces while matching
If disabled, the function will set the regex to ignore the letters case while compiling it
TextHandler with the first match or the default value
Utility Methods
prettify()
TextHandler with formatted HTML
has_class()
The class name to check for
True if element has class with that name, otherwise False
urljoin()
The relative URL to join
iterancestors()
Selector objects
find_ancestor()
A function that takes each ancestor as an argument and returns True/False
None otherwise
Adaptive Methods
save()
The element itself to save to storage. Can be a
Selector or pure HtmlElementThe identifier that will be used to retrieve the element later from the storage
retrieve()
The identifier used to retrieve the element from the storage
None
relocate()
The element to relocate in the tree
The minimum percentage to accept. Don’t play with this number unless you know what you’re doing
If True, the return result will be converted to
Selectors objectSelectors object
Magic Methods
__getitem__()
__contains__()
Aliases
For compatibility with Scrapy/Parsel:extract()- alias forgetall()extract_first()- alias forget()