Documentation Index
Fetch the complete documentation index at: https://mintlify.com/exegia/corpora-py/llms.txt
Use this file to discover all available pages before exploring further.
fetch_datasets_from_git shallow-clones a git repository into a temporary directory, recursively scans it for Text-Fabric dataset directories (any folder containing both otext.tf and otype.tf side by side), and returns the list of matching paths. This is the fastest way to pull a public corpus — such as the ETCBC Hebrew Bible — onto your machine without manually navigating the repository layout.
Function Signature
Parameters
URL of the git repository to clone. Accepts any URL that
git clone accepts (HTTPS, SSH, or a local file path).Base directory under which the temporary clone is created. The clone lands at
temp_base/.temp/<tmpXXX>/. Defaults to the current working directory (Path.cwd()).Returns
list[Path] — one entry for each directory inside the cloned repository that contains both otext.tf and otype.tf. A single repository may yield multiple paths when it bundles several dataset versions (e.g. tf/2021, tf/2023).
Raises
subprocess.CalledProcessError— ifgit cloneexits with a non-zero status (bad URL, network failure, authentication error, etc.).FileNotFoundError— ifgitis not found onPATH.
Example
Loading the Fetched Dataset
After fetching, pass the returned paths straight tocorpus_manager:
How It Works
Shallow clone
Runs
git clone --depth=1 <git_url> <clone_dir> with subprocess.run(..., check=True). The --depth=1 flag fetches only the latest commit, keeping the download small even for large repositories.Locate datasets
Recursively walks the cloned directory with
Path.rglob("otext.tf"). For each otext.tf found, it checks whether otype.tf exists in the same directory. Matching directories are collected into the result list.The clone is placed at
temp_base/.temp/<tmpXXX>/. The caller is responsible for moving or copying the dataset directories to a permanent location before the temporary directory is cleaned up by the operating system or your own cleanup logic.