Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/exegia/corpora-py/llms.txt

Use this file to discover all available pages before exploring further.

The exegia.utils module provides four converter functions for importing external documents as queryable corpora or structured XML. Whether you are working with EPUB ebooks, directories of HTML files, or need a distributable corpus bundle, each function handles one conversion path and returns either an output Path or a string.

convert_epub_to_tf

from exegia.utils.convert_epub_to_tf import convert_epub_to_tf

def convert_epub_to_tf(
    epub_path: str | Path,
    output_dir: str | Path,
    corpus_name: str | None = None,
    version: str = "1.0",
    tokenize: bool = True,
    on_progress: Callable[[int, int, float], None] | None = None,
) -> Path
Converts an EPUB file to a Text-Fabric dataset. The converter extracts Dublin Core metadata, iterates every spine item (page/chapter), parses the HTML content with BeautifulSoup, and writes .tf feature files using the tf.convert.walker walker API. Node hierarchy produced:
book
  └─ chapter
       └─ element | paragraph | link
            └─ word  (slot)
epub_path
str | Path
required
File path or URL to the source EPUB file.
output_dir
str | Path
required
Directory where the TF feature files will be written. Created automatically if it does not exist.
corpus_name
str | None
Name for the corpus. Defaults to the title field from the EPUB Dublin Core metadata, falling back to "EPUBCorpus".
version
str
Version string embedded in the TF metadata. Defaults to "1.0".
tokenize
bool
When True (default), text content is split on whitespace into individual word slots. When False, each text run is stored as a single slot.
on_progress
Callable[[int, int, float], None] | None
Optional progress callback invoked during page extraction. Receives (current: int, total: int, percent: float).
returns
Path
Path to the generated TF directory.

EPUBToTFConverter

For more control, use the underlying class directly instead of the convenience function.
from exegia.utils.convert_epub_to_tf import EPUBToTFConverter

converter = EPUBToTFConverter(
    epub_path="book.epub",
    output_dir="tf_output/",
    corpus_name="MyBook",
    version="1.0",
    tokenize=True,
    on_progress=lambda cur, tot, pct: print(f"{pct:.1f}%")
)
tf_path = converter.convert()
The constructor accepts the same parameters as convert_epub_to_tf. After calling convert(), two public attributes are populated:
metadata
dict
Dublin Core metadata extracted from the EPUB (keys include title, creator, publisher, language, identifier). Each value is a list of strings.
pages
list[dict]
List of page dictionaries extracted from the EPUB spine. Each dict contains index, id, name, and html keys.

convert_html_to_tf

from exegia.utils.convert_html_to_tf import convert_html_to_tf

def convert_html_to_tf(
    input_dir: str | Path,
    output_dir: str | Path,
    corpus_name: str = "HTMLCorpus",
    version: str = "1.0",
    advanced: bool = False,
    **kwargs,
) -> Path
Converts a directory of .html / .htm files to a Text-Fabric dataset. Files are sorted and processed in alphabetical order. Each file becomes a document root node. Node hierarchy — standard mode (advanced=False):
document
  └─ element
       └─ word  (slot)
Node hierarchy — advanced mode (advanced=True):
document
  ├─ paragraph           (from p, div, section, article)
  │    └─ word  (slot)
  ├─ link                (from a)
  │    └─ word  (slot)
  ├─ table
  │    └─ row
  │         └─ cell
  │              └─ word  (slot)
  └─ element             (all other tags)
       └─ word  (slot)
input_dir
str | Path
required
Directory containing the HTML files to convert.
output_dir
str | Path
required
Directory where TF feature files will be written.
corpus_name
str
Name for the corpus. Defaults to "HTMLCorpus".
version
str
Version string embedded in TF metadata. Defaults to "1.0".
advanced
bool
When False (default), uses HTMLToTFConverter with a flat document → element → word hierarchy. When True, uses AdvancedHTMLToTFConverter, which produces semantic nodes for paragraphs, links, and tables and extracts <head> metadata.
**kwargs
any
Additional keyword arguments forwarded to the converter. Supported keys: tokenize (bool), preserve_whitespace (bool).
returns
Path
Path to the generated TF directory.

convert_epub_to_tei

from exegia.utils.convert_epub_to_xml import convert_epub_to_tei

def convert_epub_to_tei(
    epub_path: str,
    output_path: str | None = None,
) -> str
Converts an EPUB file to a TEI P5 XML string. Uses ebooklib to read the EPUB and lxml to build the XML tree, then serialises to a UTF-8 string with an XML declaration. Optionally writes the result to a file. TEI document structure produced:
<TEI>
  <teiHeader>
    <fileDesc>      ← title, authors, publisher, date, identifier
    <encodingDesc>  ← conversion note
    <profileDesc>   ← abstract and subject keywords (when present)
  </teiHeader>
  <text>
    <body>
      <div>         ← one per EPUB document item
        <head>      ← first heading extracted from the document
        <p> / <div> / <quote> / <list> / <table> ...
epub_path
str
required
File path to the source EPUB file.
output_path
str | None
Optional file path where the TEI XML will be saved. When omitted, the XML is returned only as a string.
returns
str
The complete TEI XML document as a string, including the <?xml?> declaration.
Raises
ExceptionCondition
FileNotFoundErrorepub_path does not exist
ImportErrorebooklib is not installed

convert_to_exg

from exegia.utils.convert_to_exg import convert_to_exg
from pathlib import Path

def convert_to_exg(
    dataset_dir: Path,
    destination: Path,
) -> Path
Packages an existing Text-Fabric dataset directory into a portable .exg bundle — a zip archive that includes a manifest, a file index, an empty git repository stub for future versioning, and the original TF data compressed as corpus.exgc. Bundle layout:
{name}.exg                ← final deliverable (zip)
  ├── manifest.json       ← corpus metadata parsed from otext.tf / otype.tf
  ├── index.json          ← list of all .tf files with sizes
  ├── .git                ← empty git repository stub
  └── corpus.exgc         ← the original .tf dataset (zip)
dataset_dir
Path
required
Path to the folder containing .tf files. Must contain both otext.tf and otype.tf.
destination
Path
required
Directory where the final .exg file will be saved. Created automatically if it does not exist.
returns
Path
Path to the produced .exg file (named after dataset_dir.name).
Raises
ExceptionCondition
FileNotFoundErrordataset_dir does not exist or is not a directory
ValueErrorotext.tf or otype.tf are missing from dataset_dir

Parameter summary

FunctionInputOutputFormat
convert_epub_to_tfEPUB fileTF directoryText-Fabric
convert_html_to_tfHTML directoryTF directoryText-Fabric
convert_epub_to_teiEPUB fileTEI XML string / fileTEI P5 XML
convert_to_exgTF directory.exg archiveDistributable bundle

Build docs developers (and LLMs) love