TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/exegia/corpora-py/llms.txt
Use this file to discover all available pages before exploring further.
.exg format is a zip archive that bundles a Text-Fabric dataset with structured metadata for distribution and future versioning. Inside the archive you will find a manifest.json with rich corpus metadata extracted from otext.tf and otype.tf, an index.json listing every .tf file with its size, an empty git repository reserved for future diff and versioning features, and a corpus.exgc file which is the compressed TF dataset itself.
Archive Layout
Function Signature
Parameters
Path to the Text-Fabric dataset directory. Must contain both
otext.tf and otype.tf at the top level. The directory name becomes the stem of the output .exg filename.Directory where the produced
.exg file will be saved. Created automatically if it does not exist.Returns
Path — path to the produced .exg file, e.g. ./dist/BHSA.exg.
Raises
FileNotFoundError— ifdataset_dirdoes not exist or is not a directory.ValueError— if eitherotext.tforotype.tfis missing fromdataset_dir.
Example
Packaging Steps
Parse TF headers → manifest.json
Reads the
@key=value header lines from otext.tf and otype.tf and assembles a structured manifest.json. Node types are extracted from the data section of otype.tf. Filesystem stats (file count, total size) are appended.List .tf files → index.json
Recursively finds every
.tf file under dataset_dir and writes an index.json array with the relative path and byte size of each file.Initialise empty git repository
Runs
git init in the staging directory to create a .git/ stub. This is reserved for future corpus diff and version-tracking features.Compress the dataset → corpus.exgc
Zips the entire
dataset_dir into corpus.exgc. This is a standard zip file with a .exgc extension — you can unzip it with any zip tool if you need to inspect or extract the raw .tf files.manifest.json Fields
The manifest is extracted automatically from the TF file headers. All fields are strings or lists of strings:| Field | Source | Description |
|---|---|---|
name | otext.tf @name | Corpus name |
version | otext.tf @version | Corpus version |
description | otext.tf @description | Description |
written_by | otext.tf @writtenBy | Tool that wrote the TF files |
date_written | otext.tf @dateWritten | Date the TF files were written |
section_types | otext.tf @sectionTypes | Section hierarchy (comma-separated, split to list) |
section_features | otext.tf @sectionFeatures | Section feature names (comma-separated, split to list) |
text_formats | otext.tf @fmt:* | Available text format templates keyed by format name |
node_types | otype.tf data section | List of unique node types in order of appearance |
source_folder | Filesystem | Name of the source dataset directory |
tf_file_count | Filesystem scan | Number of .tf files in the dataset |
total_size_bytes | Filesystem scan | Total size of all files in the dataset directory |
The
.exg packaging step requires git to be installed and available on PATH. If git is not found, the git init call raises FileNotFoundError. If git is present but the init command fails, it raises subprocess.CalledProcessError.