Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/exegia/corpora-py/llms.txt

Use this file to discover all available pages before exploring further.

The .exg format is a zip archive that bundles a Text-Fabric dataset with structured metadata for distribution and future versioning. Inside the archive you will find a manifest.json with rich corpus metadata extracted from otext.tf and otype.tf, an index.json listing every .tf file with its size, an empty git repository reserved for future diff and versioning features, and a corpus.exgc file which is the compressed TF dataset itself.

Archive Layout

{name}.exg          (zip archive)
├── manifest.json   (corpus metadata extracted from otext.tf / otype.tf)
├── index.json      (list of .tf files with sizes)
├── .git/           (empty git repo for versioning)
└── corpus.exgc     (zip of the original .tf dataset directory)

Function Signature

from exegia.utils.convert_to_exg import convert_to_exg
from pathlib import Path

def convert_to_exg(
    dataset_dir: Path,
    destination: Path,
) -> Path:
    ...

Parameters

dataset_dir
Path
required
Path to the Text-Fabric dataset directory. Must contain both otext.tf and otype.tf at the top level. The directory name becomes the stem of the output .exg filename.
destination
Path
required
Directory where the produced .exg file will be saved. Created automatically if it does not exist.

Returns

Path — path to the produced .exg file, e.g. ./dist/BHSA.exg.

Raises

  • FileNotFoundError — if dataset_dir does not exist or is not a directory.
  • ValueError — if either otext.tf or otype.tf is missing from dataset_dir.

Example

from pathlib import Path
from exegia.utils.convert_to_exg import convert_to_exg

exg_path = convert_to_exg(
    dataset_dir=Path("~/.exegia/datasets/bibles/BHSA").expanduser(),
    destination=Path("./dist/"),
)
print(f"Package created: {exg_path}")
# Package created: dist/BHSA.exg

Packaging Steps

1

Parse TF headers → manifest.json

Reads the @key=value header lines from otext.tf and otype.tf and assembles a structured manifest.json. Node types are extracted from the data section of otype.tf. Filesystem stats (file count, total size) are appended.
2

List .tf files → index.json

Recursively finds every .tf file under dataset_dir and writes an index.json array with the relative path and byte size of each file.
3

Initialise empty git repository

Runs git init in the staging directory to create a .git/ stub. This is reserved for future corpus diff and version-tracking features.
4

Compress the dataset → corpus.exgc

Zips the entire dataset_dir into corpus.exgc. This is a standard zip file with a .exgc extension — you can unzip it with any zip tool if you need to inspect or extract the raw .tf files.
5

Bundle everything → {name}.exg

Zips the staging directory (containing manifest.json, index.json, .git/, and corpus.exgc) into the final {dataset_dir.name}.exg file in your destination directory.

manifest.json Fields

The manifest is extracted automatically from the TF file headers. All fields are strings or lists of strings:
FieldSourceDescription
nameotext.tf @nameCorpus name
versionotext.tf @versionCorpus version
descriptionotext.tf @descriptionDescription
written_byotext.tf @writtenByTool that wrote the TF files
date_writtenotext.tf @dateWrittenDate the TF files were written
section_typesotext.tf @sectionTypesSection hierarchy (comma-separated, split to list)
section_featuresotext.tf @sectionFeaturesSection feature names (comma-separated, split to list)
text_formatsotext.tf @fmt:*Available text format templates keyed by format name
node_typesotype.tf data sectionList of unique node types in order of appearance
source_folderFilesystemName of the source dataset directory
tf_file_countFilesystem scanNumber of .tf files in the dataset
total_size_bytesFilesystem scanTotal size of all files in the dataset directory
The .exg packaging step requires git to be installed and available on PATH. If git is not found, the git init call raises FileNotFoundError. If git is present but the init command fails, it raises subprocess.CalledProcessError.

Build docs developers (and LLMs) love