The TextHandler class extends Python’s standard str class to provide enhanced text processing capabilities including regex operations, JSON parsing, cleaning, and more.
Overview
TextHandler is used throughout Scrapling to represent text content. It’s returned by properties like Selector.text and methods like get(), re(), etc. It maintains all standard string functionality while adding powerful text manipulation methods.
String Methods
All standard Python string methods are available and return TextHandler objects:
strip(), lstrip(), rstrip() - Remove whitespace
upper(), lower(), capitalize(), title(), swapcase(), casefold() - Case conversion
replace() - Replace substrings
split() - Split into list (returns list of TextHandler objects)
join() - Join iterable
center(), ljust(), rjust(), zfill() - Alignment and padding
expandtabs(), translate() - Text transformation
format(), format_map() - String formatting
And many more standard string methods.
Enhanced Methods
clean()
def clean(self, remove_entities: bool = False) -> TextHandler
Return a new version of the string after removing all white spaces and consecutive spaces.
If True, also replaces HTML entities with their corresponding characters
Returns: A cleaned TextHandler object
Example:
text = TextHandler(" Hello\n\n world \t ")
clean = text.clean()
print(clean) # Output: "Hello world"
html_text = TextHandler("Hello & world")
clean = html_text.clean(remove_entities=True)
print(clean) # Output: "Hello & world"
sort()
def sort(self, reverse: bool = False) -> TextHandler
Return a sorted version of the string.
If True, sort in descending order
Returns: A TextHandler with characters sorted
Example:
text = TextHandler("dcba")
sorted_text = text.sort()
print(sorted_text) # Output: "abcd"
reverse_sorted = text.sort(reverse=True)
print(reverse_sorted) # Output: "dcba"
json()
Return JSON response if the string is valid JSON.
Returns: A dictionary parsed from the JSON string
Raises: Exception if the string is not valid JSON
Example:
json_str = TextHandler('{"name": "John", "age": 30}')
data = json_str.json()
print(data['name']) # Output: John
print(data['age']) # Output: 30
re()
def re(
self,
regex: str | Pattern,
replace_entities: bool = True,
clean_match: bool = False,
case_sensitive: bool = True,
check_match: bool = False,
) -> TextHandlers | bool
Apply the given regex to the current text and return a list of strings with the matches.
Can be either a compiled regular expression or a string
If enabled, character entity references are replaced by their corresponding character in results
If enabled, ignores all whitespaces and consecutive spaces while matching
If disabled, the regex will ignore letter case while matching
Used to quickly check if this regex matches or not without any operations on the results. Returns boolean instead of matches
Returns: A TextHandlers list of matches, or a boolean if check_match=True
Example:
text = TextHandler("Price: $19.99, Discount: $5.00")
prices = text.re(r'\$([\d.]+)')
print(prices) # Output: ['19.99', '5.00']
# Check if pattern exists
has_price = text.re(r'\$\d+', check_match=True)
print(has_price) # Output: True
# Case insensitive
text = TextHandler("Hello WORLD")
matches = text.re(r'hello', case_sensitive=False)
print(matches) # Output: ['Hello']
re_first()
def re_first(
self,
regex: str | Pattern,
default: Any = None,
replace_entities: bool = True,
clean_match: bool = False,
case_sensitive: bool = True,
) -> TextHandler
Apply the given regex to text and return the first match if found, otherwise return the default value.
Can be either a compiled regular expression or a string
The default value to be returned if there is no match
If enabled, character entity references are replaced by their corresponding character
If enabled, ignores all whitespaces and consecutive spaces while matching
If disabled, the regex will ignore letter case while matching
Returns: A TextHandler with the first match or the default value
Example:
text = TextHandler("Product ID: ABC123")
product_id = text.re_first(r'ID: ([A-Z0-9]+)')
print(product_id) # Output: ABC123
# With default
text = TextHandler("No price here")
price = text.re_first(r'\$(\d+)', '0.00')
print(price) # Output: 0.00
Compatibility Methods
For compatibility with Scrapy/Parsel:
get()
def get(self, default=None) -> TextHandler
Returns self (for Scrapy/Parsel compatibility).
Example:
text = TextHandler("Hello")
result = text.get()
print(result) # Output: Hello
get_all()
def get_all(self) -> TextHandler
Returns self (for Scrapy/Parsel compatibility).
Example:
text = TextHandler("Hello")
result = text.get_all()
print(result) # Output: Hello
Aliases
extract() - alias for get_all()
extract_first() - alias for get()
Common Use Cases
text = TextHandler("Total: 1,234.56 USD")
number = text.re_first(r'([\d,]+\.\d+)').replace(',', '')
price = float(number)
print(price) # Output: 1234.56
Clean and Normalize Text
text = TextHandler(" Product\n Name: \t Widget ")
clean = text.clean()
print(clean) # Output: "Product Name: Widget"
Parse Structured Data
json_text = TextHandler('{"items": [{"id": 1}, {"id": 2}]}')
data = json_text.json()
for item in data['items']:
print(f"ID: {item['id']}")
text = TextHandler("Colors: red, blue, green")
colors = text.re(r'(\w+)(?:,|$)')
print(colors) # Output: ['red', 'blue', 'green']
Case-Insensitive Search
text = TextHandler("The Quick Brown Fox")
has_fox = text.re(r'fox', case_sensitive=False, check_match=True)
print(has_fox) # Output: True
Handle HTML Entities
text = TextHandler("Price: £19.99 & free shipping")
clean = text.clean(remove_entities=True)
print(clean) # Output: "Price: £19.99 & free shipping"
Indexing and Slicing
text = TextHandler("Hello World")
# Indexing returns TextHandler
first = text[0]
print(first) # Output: H
print(type(first)) # Output: <class 'TextHandler'>
# Slicing returns TextHandler
word = text[:5]
print(word) # Output: Hello
print(type(word)) # Output: <class 'TextHandler'>
String Operations
text = TextHandler("Hello")
# Concatenation
greeting = text + " World"
print(greeting) # Output: Hello World
# Repetition
repeated = text * 3
print(repeated) # Output: HelloHelloHello
# Membership
print('ell' in text) # Output: True
# Comparison
print(text == "Hello") # Output: True
print(text < "World") # Output: True
Notes
TextHandler is a subclass of str, so all standard string operations work
- Methods that modify strings return new
TextHandler objects (strings are immutable)
- The
clean() method is particularly useful for normalizing scraped text
- Regex methods use the
re module internally and support all Python regex features
- JSON parsing uses
orjson for high performance
- HTML entity replacement uses the
w3lib library
See Also
- Selectors - List container that uses
TextHandler for text operations
- Selector - Uses
TextHandler for text content via the text property
- AttributesHandler - Uses
TextHandler for attribute values