TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/TangibleResearch/Halgorithem/llms.txt
Use this file to discover all available pages before exploring further.
Halgorithem.text_processing module provides NLP utility functions used internally by the Halgorithm class — but also accessible directly if you need lower-level text analysis.
Functions
clean_text
The raw input string to clean.
Cleaned plain-text string. Guaranteed to end with
., !, or ?.- Strips Markdown via
strip_markdown()(powered by markdown-it-py) - Normalizes unicode, whitespace, and quotation marks via textacy
- Removes
(,),;,:, and"punctuation - Appends a period if the text does not already end with
.,!, or? - Applies clean-text normalization: ASCII conversion, URL removal, email removal
tokenize
The input string to tokenize.
List of lowercase token strings with punctuation, spaces, and sklearn
ENGLISH_STOP_WORDS removed.lemmatize_tokens
tokenize() but returns the lemmatized form of each token rather than the surface form. Pronoun lemmas (-PRON-) are excluded.
The input string to lemmatize.
List of lowercase lemma strings, filtered the same way as
tokenize() and with -PRON- lemmas removed.extract_numbers
The input string to extract numbers from.
Deduplicated list of number strings. quantulum3 results (e.g.
"25000000000.0" for "$25 billion") appear first; bare digits not caught by quantulum3 are appended from the regex fallback \b\d+(?:\.\d+)?\b.extract_entities
The input string to extract entities from.
List of tuples. Each tuple contains the lowercase, non-digit tokens of one recognized entity span (stop words and punctuation are removed via
tokenize()).get_synonyms
lru_cache(maxsize=4096). Underscores in lemma names are replaced with spaces.
The word to look up synonyms for.
Set of synonym strings. Returns an empty set if the WordNet corpus is unavailable.
Synonym expansion depends on the
WORDNET_AVAILABLE flag set at import time in nlp.py. If the WordNet corpus has not been downloaded (nltk.download("wordnet")), WORDNET_AVAILABLE is False and get_synonyms() silently returns an empty set without raising an error.has_negation_mismatch
The claim text to check.
The reference chunk text to compare against.
True if one text contains negated tokens and the other does not — indicating a polarity mismatch.strip_markdown
code_inline, fence, and code_block token content.
The Markdown string to convert.
Plain text string. Returns the original
text unchanged if no extractable tokens are found.