Create Distributable .exg Archives from TF Datasets

The .exg format is a zip archive that bundles a Text-Fabric dataset with structured metadata for distribution and future versioning. Inside the archive you will find a manifest.json with rich corpus metadata extracted from otext.tf and otype.tf, an index.json listing every .tf file with its size, an empty git repository reserved for future diff and versioning features, and a corpus.exgc file which is the compressed TF dataset itself.

Archive Layout

{name}.exg          (zip archive)
├── manifest.json   (corpus metadata extracted from otext.tf / otype.tf)
├── index.json      (list of .tf files with sizes)
├── .git/           (empty git repo for versioning)
└── corpus.exgc     (zip of the original .tf dataset directory)

Function Signature

from exegia.utils.convert_to_exg import convert_to_exg
from pathlib import Path

def convert_to_exg(
    dataset_dir: Path,
    destination: Path,
) -> Path:
    ...

Parameters

dataset_dir

Path

required

Path to the Text-Fabric dataset directory. Must contain both otext.tf and otype.tf at the top level. The directory name becomes the stem of the output .exg filename.

destination

Path

required

Directory where the produced .exg file will be saved. Created automatically if it does not exist.

Returns

Path — path to the produced .exg file, e.g. ./dist/BHSA.exg.

Raises

FileNotFoundError — if dataset_dir does not exist or is not a directory.
ValueError — if either otext.tf or otype.tf is missing from dataset_dir.

Example

from pathlib import Path
from exegia.utils.convert_to_exg import convert_to_exg

exg_path = convert_to_exg(
    dataset_dir=Path("~/.exegia/datasets/bibles/BHSA").expanduser(),
    destination=Path("./dist/"),
)
print(f"Package created: {exg_path}")
# Package created: dist/BHSA.exg

Packaging Steps

Parse TF headers → manifest.json

Reads the @key=value header lines from otext.tf and otype.tf and assembles a structured manifest.json. Node types are extracted from the data section of otype.tf. Filesystem stats (file count, total size) are appended.

List .tf files → index.json

Recursively finds every .tf file under dataset_dir and writes an index.json array with the relative path and byte size of each file.

Initialise empty git repository

Runs git init in the staging directory to create a .git/ stub. This is reserved for future corpus diff and version-tracking features.

Compress the dataset → corpus.exgc

Zips the entire dataset_dir into corpus.exgc. This is a standard zip file with a .exgc extension — you can unzip it with any zip tool if you need to inspect or extract the raw .tf files.

Bundle everything → {name}.exg

Zips the staging directory (containing manifest.json, index.json, .git/, and corpus.exgc) into the final {dataset_dir.name}.exg file in your destination directory.

manifest.json Fields

The manifest is extracted automatically from the TF file headers. All fields are strings or lists of strings:

Field	Source	Description
`name`	`otext.tf @name`	Corpus name
`version`	`otext.tf @version`	Corpus version
`description`	`otext.tf @description`	Description
`written_by`	`otext.tf @writtenBy`	Tool that wrote the TF files
`date_written`	`otext.tf @dateWritten`	Date the TF files were written
`section_types`	`otext.tf @sectionTypes`	Section hierarchy (comma-separated, split to list)
`section_features`	`otext.tf @sectionFeatures`	Section feature names (comma-separated, split to list)
`text_formats`	`otext.tf @fmt:*`	Available text format templates keyed by format name
`node_types`	`otype.tf` data section	List of unique node types in order of appearance
`source_folder`	Filesystem	Name of the source dataset directory
`tf_file_count`	Filesystem scan	Number of `.tf` files in the dataset
`total_size_bytes`	Filesystem scan	Total size of all files in the dataset directory

The .exg packaging step requires git to be installed and available on PATH. If git is not found, the git init call raises FileNotFoundError. If git is present but the init command fails, it raises subprocess.CalledProcessError.

Get Started

MCP Server

Corpus Management

Authentication

Configuration

Create Distributable .exg Archives from TF Datasets

Archive Layout

Function Signature

Parameters

Returns

Raises

Example

Packaging Steps

manifest.json Fields

Build docs developers (and LLMs) love

Get Started

MCP Server

Corpus Management

Authentication

Configuration

Documentation Index

​Archive Layout

​Function Signature

​Parameters

​Returns

​Raises

​Example

​Packaging Steps

​manifest.json Fields

Build docs developers (and LLMs) love

Archive Layout

Function Signature

Parameters

Returns

Raises

Example

Packaging Steps

manifest.json Fields