Converter Functions: EPUB, HTML, and TEI XML Output

The exegia.utils module provides four converter functions for importing external documents as queryable corpora or structured XML. Whether you are working with EPUB ebooks, directories of HTML files, or need a distributable corpus bundle, each function handles one conversion path and returns either an output Path or a string.

convert_epub_to_tf

from exegia.utils.convert_epub_to_tf import convert_epub_to_tf

def convert_epub_to_tf(
    epub_path: str | Path,
    output_dir: str | Path,
    corpus_name: str | None = None,
    version: str = "1.0",
    tokenize: bool = True,
    on_progress: Callable[[int, int, float], None] | None = None,
) -> Path

Converts an EPUB file to a Text-Fabric dataset. The converter extracts Dublin Core metadata, iterates every spine item (page/chapter), parses the HTML content with BeautifulSoup, and writes .tf feature files using the tf.convert.walker walker API. Node hierarchy produced:

book
  └─ chapter
       └─ element | paragraph | link
            └─ word  (slot)

epub_path

str | Path

required

File path or URL to the source EPUB file.

output_dir

str | Path

required

Directory where the TF feature files will be written. Created automatically if it does not exist.

corpus_name

str | None

Name for the corpus. Defaults to the title field from the EPUB Dublin Core metadata, falling back to "EPUBCorpus".

version

str

Version string embedded in the TF metadata. Defaults to "1.0".

tokenize

bool

When True (default), text content is split on whitespace into individual word slots. When False, each text run is stored as a single slot.

on_progress

Callable[[int, int, float], None] | None

Optional progress callback invoked during page extraction. Receives (current: int, total: int, percent: float).

returns

Path

Path to the generated TF directory.

EPUBToTFConverter

For more control, use the underlying class directly instead of the convenience function.

from exegia.utils.convert_epub_to_tf import EPUBToTFConverter

converter = EPUBToTFConverter(
    epub_path="book.epub",
    output_dir="tf_output/",
    corpus_name="MyBook",
    version="1.0",
    tokenize=True,
    on_progress=lambda cur, tot, pct: print(f"{pct:.1f}%")
)
tf_path = converter.convert()

The constructor accepts the same parameters as convert_epub_to_tf. After calling convert(), two public attributes are populated:

metadata

dict

Dublin Core metadata extracted from the EPUB (keys include title, creator, publisher, language, identifier). Each value is a list of strings.

pages

list[dict]

List of page dictionaries extracted from the EPUB spine. Each dict contains index, id, name, and html keys.

convert_html_to_tf

from exegia.utils.convert_html_to_tf import convert_html_to_tf

def convert_html_to_tf(
    input_dir: str | Path,
    output_dir: str | Path,
    corpus_name: str = "HTMLCorpus",
    version: str = "1.0",
    advanced: bool = False,
    **kwargs,
) -> Path

Converts a directory of .html / .htm files to a Text-Fabric dataset. Files are sorted and processed in alphabetical order. Each file becomes a document root node. Node hierarchy — standard mode (advanced=False):

document
  └─ element
       └─ word  (slot)

Node hierarchy — advanced mode (advanced=True):

document
  ├─ paragraph           (from p, div, section, article)
  │    └─ word  (slot)
  ├─ link                (from a)
  │    └─ word  (slot)
  ├─ table
  │    └─ row
  │         └─ cell
  │              └─ word  (slot)
  └─ element             (all other tags)
       └─ word  (slot)

input_dir

str | Path

required

Directory containing the HTML files to convert.

output_dir

str | Path

required

Directory where TF feature files will be written.

corpus_name

str

Name for the corpus. Defaults to "HTMLCorpus".

version

str

Version string embedded in TF metadata. Defaults to "1.0".

advanced

bool

When False (default), uses HTMLToTFConverter with a flat document → element → word hierarchy. When True, uses AdvancedHTMLToTFConverter, which produces semantic nodes for paragraphs, links, and tables and extracts <head> metadata.

**kwargs

any

Additional keyword arguments forwarded to the converter. Supported keys: tokenize (bool), preserve_whitespace (bool).

returns

Path

Path to the generated TF directory.

convert_epub_to_tei

from exegia.utils.convert_epub_to_xml import convert_epub_to_tei

def convert_epub_to_tei(
    epub_path: str,
    output_path: str | None = None,
) -> str

Converts an EPUB file to a TEI P5 XML string. Uses ebooklib to read the EPUB and lxml to build the XML tree, then serialises to a UTF-8 string with an XML declaration. Optionally writes the result to a file. TEI document structure produced:

<TEI>
  <teiHeader>
    <fileDesc>      ← title, authors, publisher, date, identifier
    <encodingDesc>  ← conversion note
    <profileDesc>   ← abstract and subject keywords (when present)
  </teiHeader>
  <text>
    <body>
      <div>         ← one per EPUB document item
        <head>      ← first heading extracted from the document
        <p> / <div> / <quote> / <list> / <table> ...

epub_path

str

required

File path to the source EPUB file.

output_path

str | None

Optional file path where the TEI XML will be saved. When omitted, the XML is returned only as a string.

returns

str

The complete TEI XML document as a string, including the <?xml?> declaration.

Raises

Exception	Condition
`FileNotFoundError`	`epub_path` does not exist
`ImportError`	`ebooklib` is not installed

convert_to_exg

from exegia.utils.convert_to_exg import convert_to_exg
from pathlib import Path

def convert_to_exg(
    dataset_dir: Path,
    destination: Path,
) -> Path

Packages an existing Text-Fabric dataset directory into a portable .exg bundle — a zip archive that includes a manifest, a file index, an empty git repository stub for future versioning, and the original TF data compressed as corpus.exgc. Bundle layout:

{name}.exg                ← final deliverable (zip)
  ├── manifest.json       ← corpus metadata parsed from otext.tf / otype.tf
  ├── index.json          ← list of all .tf files with sizes
  ├── .git                ← empty git repository stub
  └── corpus.exgc         ← the original .tf dataset (zip)

dataset_dir

Path

required

Path to the folder containing .tf files. Must contain both otext.tf and otype.tf.

destination

Path

required

Directory where the final .exg file will be saved. Created automatically if it does not exist.

returns

Path

Path to the produced .exg file (named after dataset_dir.name).

Raises

Exception	Condition
`FileNotFoundError`	`dataset_dir` does not exist or is not a directory
`ValueError`	`otext.tf` or `otype.tf` are missing from `dataset_dir`

Parameter summary

Function	Input	Output	Format
`convert_epub_to_tf`	EPUB file	TF directory	Text-Fabric
`convert_html_to_tf`	HTML directory	TF directory	Text-Fabric
`convert_epub_to_tei`	EPUB file	TEI XML string / file	TEI P5 XML
`convert_to_exg`	TF directory	`.exg` archive	Distributable bundle

MCP Tools

Python API

Converter Functions: EPUB, HTML, and TEI XML Output

convert_epub_to_tf

EPUBToTFConverter

convert_html_to_tf

convert_epub_to_tei

convert_to_exg

Parameter summary

Build docs developers (and LLMs) love

MCP Tools

Python API

Documentation Index

​convert_epub_to_tf

​EPUBToTFConverter

​convert_html_to_tf

​convert_epub_to_tei

​convert_to_exg

​Parameter summary

Build docs developers (and LLMs) love

convert_epub_to_tf

EPUBToTFConverter

convert_html_to_tf

convert_epub_to_tei

convert_to_exg

Parameter summary