TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/exegia/corpora-py/llms.txt
Use this file to discover all available pages before exploring further.
exegia.utils module provides four converter functions for importing external documents as queryable corpora or structured XML. Whether you are working with EPUB ebooks, directories of HTML files, or need a distributable corpus bundle, each function handles one conversion path and returns either an output Path or a string.
convert_epub_to_tf
.tf feature files using the tf.convert.walker walker API.
Node hierarchy produced:
File path or URL to the source EPUB file.
Directory where the TF feature files will be written. Created automatically if it does not exist.
Name for the corpus. Defaults to the
title field from the EPUB Dublin Core metadata, falling back to "EPUBCorpus".Version string embedded in the TF metadata. Defaults to
"1.0".When
True (default), text content is split on whitespace into individual word slots. When False, each text run is stored as a single slot.Optional progress callback invoked during page extraction. Receives
(current: int, total: int, percent: float).Path to the generated TF directory.
EPUBToTFConverter
For more control, use the underlying class directly instead of the convenience function.convert_epub_to_tf. After calling convert(), two public attributes are populated:
Dublin Core metadata extracted from the EPUB (keys include
title, creator, publisher, language, identifier). Each value is a list of strings.List of page dictionaries extracted from the EPUB spine. Each dict contains
index, id, name, and html keys.convert_html_to_tf
.html / .htm files to a Text-Fabric dataset. Files are sorted and processed in alphabetical order. Each file becomes a document root node.
Node hierarchy — standard mode (advanced=False):
advanced=True):
Directory containing the HTML files to convert.
Directory where TF feature files will be written.
Name for the corpus. Defaults to
"HTMLCorpus".Version string embedded in TF metadata. Defaults to
"1.0".When
False (default), uses HTMLToTFConverter with a flat document → element → word hierarchy. When True, uses AdvancedHTMLToTFConverter, which produces semantic nodes for paragraphs, links, and tables and extracts <head> metadata.Additional keyword arguments forwarded to the converter. Supported keys:
tokenize (bool), preserve_whitespace (bool).Path to the generated TF directory.
convert_epub_to_tei
ebooklib to read the EPUB and lxml to build the XML tree, then serialises to a UTF-8 string with an XML declaration. Optionally writes the result to a file.
TEI document structure produced:
File path to the source EPUB file.
Optional file path where the TEI XML will be saved. When omitted, the XML is returned only as a string.
The complete TEI XML document as a string, including the
<?xml?> declaration.| Exception | Condition |
|---|---|
FileNotFoundError | epub_path does not exist |
ImportError | ebooklib is not installed |
convert_to_exg
.exg bundle — a zip archive that includes a manifest, a file index, an empty git repository stub for future versioning, and the original TF data compressed as corpus.exgc.
Bundle layout:
Path to the folder containing
.tf files. Must contain both otext.tf and otype.tf.Directory where the final
.exg file will be saved. Created automatically if it does not exist.Path to the produced
.exg file (named after dataset_dir.name).| Exception | Condition |
|---|---|
FileNotFoundError | dataset_dir does not exist or is not a directory |
ValueError | otext.tf or otype.tf are missing from dataset_dir |
Parameter summary
| Function | Input | Output | Format |
|---|---|---|---|
convert_epub_to_tf | EPUB file | TF directory | Text-Fabric |
convert_html_to_tf | HTML directory | TF directory | Text-Fabric |
convert_epub_to_tei | EPUB file | TEI XML string / file | TEI P5 XML |
convert_to_exg | TF directory | .exg archive | Distributable bundle |