TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/exegia/corpora-py/llms.txt
Use this file to discover all available pages before exploring further.
convert_html_to_tf function converts a directory of .html or .htm files into a Text-Fabric dataset. It supports two modes: a basic mode that produces a flat document → element → word hierarchy, and an advanced mode that promotes semantic block elements — paragraphs, links, and tables — to distinct, queryable node types. Both modes store HTML attributes as TF features on every node.
Node Hierarchies
Basic mode (default,advanced=False) — all HTML tags become generic element nodes:
advanced=True) — semantic elements get their own node types; all other HTML tags still become generic element nodes:
Function Signature
Parameters
Directory containing
.html or .htm source files. The converter processes all matching files sorted by filename.Directory where the generated
.tf files will be written. Created automatically if it does not exist.Name assigned to the corpus in the TF metadata. Default:
"HTMLCorpus".Version string embedded in the TF metadata. Default:
"1.0".When
True, uses AdvancedHTMLToTFConverter, which promotes p, div, section, article, a, table, tr, td/th to their own node types and also extracts <title> and <meta> tags as document-level features. Default: False.Passed via
**kwargs. When True (default), text is split on whitespace and each word becomes a separate slot. When False, each contiguous text run is emitted as a single slot.Passed via
**kwargs. When True, whitespace inside text nodes is preserved as-is. When False (default), runs of whitespace are collapsed to a single space before tokenisation.Returns
Path — the path to the generated TF directory.
Basic Example
Advanced Mode
Useadvanced=True when your HTML has meaningful block structure that you want to query as typed nodes:
<title> and <meta> tags from the document <head> and stores them as features on the document node (e.g. title, meta_description, meta_author).
Features Stored on Nodes
| Feature | Description |
|---|---|
tag | HTML tag name |
class | CSS class names |
id | HTML id attribute |
href | Link URL |
src | Source URL |
alt | Alt text |
title | title attribute |
text | Word text (on word nodes) |
filename | Source HTML filename |
depth | Nesting depth in HTML |
data-verse → data_verse).