TextHandler

The TextHandler class extends Python’s standard str class to provide enhanced text processing capabilities including regex operations, JSON parsing, cleaning, and more.

Overview

TextHandler is used throughout Scrapling to represent text content. It’s returned by properties like Selector.text and methods like get(), re(), etc. It maintains all standard string functionality while adding powerful text manipulation methods.

String Methods

All standard Python string methods are available and return TextHandler objects:

strip(), lstrip(), rstrip() - Remove whitespace
upper(), lower(), capitalize(), title(), swapcase(), casefold() - Case conversion
replace() - Replace substrings
split() - Split into list (returns list of TextHandler objects)
join() - Join iterable
center(), ljust(), rjust(), zfill() - Alignment and padding
expandtabs(), translate() - Text transformation
format(), format_map() - String formatting

And many more standard string methods.

Enhanced Methods

clean()

def clean(self, remove_entities: bool = False) -> TextHandler

Return a new version of the string after removing all white spaces and consecutive spaces.

remove_entities

bool

default:"false"

If True, also replaces HTML entities with their corresponding characters

Returns: A cleaned TextHandler object Example:

text = TextHandler("  Hello\n\n  world  \t  ")
clean = text.clean()
print(clean)  # Output: "Hello world"

html_text = TextHandler("Hello&nbsp;&amp;&nbsp;world")
clean = html_text.clean(remove_entities=True)
print(clean)  # Output: "Hello & world"

sort()

def sort(self, reverse: bool = False) -> TextHandler

Return a sorted version of the string.

reverse

bool

default:"false"

If True, sort in descending order

Returns: A TextHandler with characters sorted Example:

text = TextHandler("dcba")
sorted_text = text.sort()
print(sorted_text)  # Output: "abcd"

reverse_sorted = text.sort(reverse=True)
print(reverse_sorted)  # Output: "dcba"

json()

def json(self) -> Dict

Return JSON response if the string is valid JSON. Returns: A dictionary parsed from the JSON string Raises: Exception if the string is not valid JSON Example:

json_str = TextHandler('{"name": "John", "age": 30}')
data = json_str.json()
print(data['name'])  # Output: John
print(data['age'])   # Output: 30

re()

def re(
    self,
    regex: str | Pattern,
    replace_entities: bool = True,
    clean_match: bool = False,
    case_sensitive: bool = True,
    check_match: bool = False,
) -> TextHandlers | bool

Apply the given regex to the current text and return a list of strings with the matches.

regex

str | Pattern

required

Can be either a compiled regular expression or a string

replace_entities

bool

default:"true"

If enabled, character entity references are replaced by their corresponding character in results

clean_match

bool

default:"false"

If enabled, ignores all whitespaces and consecutive spaces while matching

case_sensitive

bool

default:"true"

If disabled, the regex will ignore letter case while matching

check_match

bool

default:"false"

Used to quickly check if this regex matches or not without any operations on the results. Returns boolean instead of matches

Returns: A TextHandlers list of matches, or a boolean if check_match=True Example:

text = TextHandler("Price: $19.99, Discount: $5.00")
prices = text.re(r'\$([\d.]+)')
print(prices)  # Output: ['19.99', '5.00']

# Check if pattern exists
has_price = text.re(r'\$\d+', check_match=True)
print(has_price)  # Output: True

# Case insensitive
text = TextHandler("Hello WORLD")
matches = text.re(r'hello', case_sensitive=False)
print(matches)  # Output: ['Hello']

re_first()

def re_first(
    self,
    regex: str | Pattern,
    default: Any = None,
    replace_entities: bool = True,
    clean_match: bool = False,
    case_sensitive: bool = True,
) -> TextHandler

Apply the given regex to text and return the first match if found, otherwise return the default value.

regex

str | Pattern

required

Can be either a compiled regular expression or a string

default

Any

default:"None"

The default value to be returned if there is no match

replace_entities

bool

default:"true"

If enabled, character entity references are replaced by their corresponding character

clean_match

bool

default:"false"

If enabled, ignores all whitespaces and consecutive spaces while matching

case_sensitive

bool

default:"true"

If disabled, the regex will ignore letter case while matching

Returns: A TextHandler with the first match or the default value Example:

text = TextHandler("Product ID: ABC123")
product_id = text.re_first(r'ID: ([A-Z0-9]+)')
print(product_id)  # Output: ABC123

# With default
text = TextHandler("No price here")
price = text.re_first(r'\$(\d+)', '0.00')
print(price)  # Output: 0.00

Compatibility Methods

For compatibility with Scrapy/Parsel:

get()

def get(self, default=None) -> TextHandler

Returns self (for Scrapy/Parsel compatibility). Example:

text = TextHandler("Hello")
result = text.get()
print(result)  # Output: Hello

get_all()

def get_all(self) -> TextHandler

Returns self (for Scrapy/Parsel compatibility). Example:

text = TextHandler("Hello")
result = text.get_all()
print(result)  # Output: Hello

Aliases

extract() - alias for get_all()
extract_first() - alias for get()

Common Use Cases

Extract Numbers

text = TextHandler("Total: 1,234.56 USD")
number = text.re_first(r'([\d,]+\.\d+)').replace(',', '')
price = float(number)
print(price)  # Output: 1234.56

Clean and Normalize Text

text = TextHandler("  Product\n  Name:   \t  Widget  ")
clean = text.clean()
print(clean)  # Output: "Product Name: Widget"

Parse Structured Data

json_text = TextHandler('{"items": [{"id": 1}, {"id": 2}]}')
data = json_text.json()
for item in data['items']:
    print(f"ID: {item['id']}")

Extract Multiple Values

text = TextHandler("Colors: red, blue, green")
colors = text.re(r'(\w+)(?:,|$)')
print(colors)  # Output: ['red', 'blue', 'green']

Case-Insensitive Search

text = TextHandler("The Quick Brown Fox")
has_fox = text.re(r'fox', case_sensitive=False, check_match=True)
print(has_fox)  # Output: True

Handle HTML Entities

text = TextHandler("Price: &pound;19.99 &amp; free shipping")
clean = text.clean(remove_entities=True)
print(clean)  # Output: "Price: £19.99 & free shipping"

Indexing and Slicing

text = TextHandler("Hello World")

# Indexing returns TextHandler
first = text[0]
print(first)  # Output: H
print(type(first))  # Output: <class 'TextHandler'>

# Slicing returns TextHandler
word = text[:5]
print(word)  # Output: Hello
print(type(word))  # Output: <class 'TextHandler'>

String Operations

text = TextHandler("Hello")

# Concatenation
greeting = text + " World"
print(greeting)  # Output: Hello World

# Repetition
repeated = text * 3
print(repeated)  # Output: HelloHelloHello

# Membership
print('ell' in text)  # Output: True

# Comparison
print(text == "Hello")  # Output: True
print(text < "World")   # Output: True

Notes

TextHandler is a subclass of str, so all standard string operations work
Methods that modify strings return new TextHandler objects (strings are immutable)
The clean() method is particularly useful for normalizing scraped text
Regex methods use the re module internally and support all Python regex features
JSON parsing uses orjson for high performance
HTML entity replacement uses the w3lib library

Fetchers

Parsing

Spiders

Utilities

Overview

String Methods

Enhanced Methods

clean()

sort()

json()

re()

re_first()

Compatibility Methods

get()

get_all()

Aliases

Common Use Cases

Extract Numbers

Clean and Normalize Text

Parse Structured Data

Extract Multiple Values

Case-Insensitive Search

Handle HTML Entities

Indexing and Slicing

String Operations

Notes

See Also

Build docs developers (and LLMs) love

Fetchers

Parsing

Spiders

Utilities

Documentation Index

​Overview

​String Methods

​Enhanced Methods

​clean()

​sort()

​json()

​re()

​re_first()

​Compatibility Methods

​get()

​get_all()

​Aliases

​Common Use Cases

​Extract Numbers

​Clean and Normalize Text

​Parse Structured Data

​Extract Multiple Values

​Case-Insensitive Search

​Handle HTML Entities

​Indexing and Slicing

​String Operations

​Notes

​See Also

Build docs developers (and LLMs) love

Overview

String Methods

Enhanced Methods

clean()

sort()

json()

re()

re_first()

Compatibility Methods

get()

get_all()

Aliases

Common Use Cases

Extract Numbers

Clean and Normalize Text

Parse Structured Data

Extract Multiple Values

Case-Insensitive Search

Handle HTML Entities

Indexing and Slicing

String Operations

Notes

See Also