TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/exegia/corpora-py/llms.txt
Use this file to discover all available pages before exploring further.
convert_epub_to_tf function converts EPUB files into queryable Text-Fabric datasets. It extracts EPUB metadata, walks the spine item by item, parses each page’s HTML content into semantic nodes, and optionally tokenises the text into individual word slots. The result is a valid TF directory that can be loaded directly by the MCP server or any cfabric-compatible tool.
Node Hierarchy Produced
The converter maps every EPUB document into a five-level node hierarchy:<a>) and tables (table → row → cell) are also emitted as distinct node types when the converter encounters them during the HTML walk.
Function Signature
Parameters
Path or URL to the source EPUB file. Accepts a local filesystem path or an HTTP(S) URL to a remote EPUB.
Directory where the generated
.tf files will be written. Created automatically if it does not exist.Name assigned to the corpus in
otext.tf. Defaults to the EPUB’s Dublin Core title field, or "EPUBCorpus" if no title is present.Version string embedded in the TF metadata. Default:
"1.0".When
True (default), text is split on whitespace and each word becomes a separate slot node. When False, each contiguous text run is emitted as a single slot.Optional progress callback invoked after each page is processed. Receives
(current, total, percent) where percent is a float between 0 and 100.Returns
Path — the path to the generated TF directory. Pass this directly to corpus_manager.load() or the cf-mcp CLI.
Full Example
Load and Query the Result
Features Stored on Nodes
Every node in the output dataset carries one or more of the following features:| Feature | Node Types | Description |
|---|---|---|
title | book, chapter | Title or chapter name |
creator | book | Author/Creator |
publisher | book | Publisher |
language | book | Language code |
identifier | book | ISBN or other identifier |
chapter_index | chapter | 0-based chapter index |
chapter_id | chapter | ID from EPUB spine |
chapter_name | chapter | Filename inside EPUB |
tag | element, paragraph, link, table, row, cell | HTML tag name |
class | element, paragraph, link, cell | CSS class names |
id | element, paragraph | HTML id attribute |
href | link | Link URL |
src | element | Source URL (e.g. <img>) |
alt | element | Alt text (e.g. <img>) |
depth | element, paragraph, link, table | Nesting depth in HTML |
text | word | The word text |
Class-Based Usage
For more control — for example to inject a live progress bar — useEPUBToTFConverter directly:
EPUB to TEI XML (Alternative Output)
If you need TEI XML instead of a Text-Fabric dataset — for example to feed into a separate XML pipeline — use theconvert_epub_to_tei function from the XML converter module:
<teiHeader> containing Dublin Core metadata and a <text><body> containing the full document content mapped to TEI elements (<p>, <div>, <head>, <ref>, <table>, etc.).