TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
lean_dojo_v2.lean_dojo.data_extraction module is the engine that turns a raw Lean 4 repository into a richly annotated dataset. It builds an abstract syntax tree for every .lean file, records tactic states before and after each step, resolves premise provenance, and stores everything in a persistent cache so the expensive build-and-extract pass only ever runs once per commit.
Tracing takes approximately 1.5× the normal Lake compile time for a repository.
Results are stored in
CACHE_DIR (default ~/.cache/lean_dojo) and a remote
cache at https://dl.fbaipublicfiles.com/lean-dojo. Subsequent calls for the
same repo and commit return immediately from cache.trace()
The primary entry-point for tracing a Lean 4 repository. Checks the local and remote caches first; only runs the full extraction pipeline on a cache miss.
The repository to trace. Must pass
repo.exists() — the commit must be
accessible at the given URL.Directory where the traced output is copied after caching. When
None, the
traced repo lives only inside CACHE_DIR and the returned TracedRepo points
there. When provided, the directory is created and the cache contents are
copied into dst_dir/<cache_dirname>.When
True, dependency packages declared in lakefile.lean / lakefile.toml
are also traced. This is required for dependency-aware premise retrieval
(e.g., LeanAgent). When False, lake exe cache get is attempted first to
skip recompiling dependencies.A
TracedRepo object loaded from cache (or from dst_dir if supplied).
check_sanity() is called automatically before returning.get_traced_repo_path()
Returns the absolute cache path of a traced repository, running the full trace pipeline if the repo is not yet cached.
The repository whose cache path to look up or compute.
If
True, looks for (or creates) the <cache_dirname>/<name>_d variant that
includes traced dependency packages. If a _d entry already exists in cache
and build_deps=False, it is reused automatically.Absolute path to the cached traced repo directory, e.g.
~/.cache/lean_dojo/durant42040-lean4-example-3e23ab0.../lean4-example.is_available_in_cache()
Lightweight check — does not trace anything. Queries both the local cache directory and the remote cache URL.
The repository to check.
True if a cached traced version exists locally or is available for download
from the remote cache. False otherwise.TracedRepo Class
TracedRepo is a frozen dataclass that represents the fully traced version of a LeanGitRepo. It holds all TracedFile objects produced by the extraction pipeline, an optional dependency graph between files, and a reference back to the original LeanGitRepo.
Core Fields
| Field | Type | Description |
|---|---|---|
repo | LeanGitRepo | The original repository that was traced. |
dependencies | Dict[str, LeanGitRepo] | Maps dependency package names to their LeanGitRepo objects. |
root_dir | Path | Absolute path to the traced repo root on disk. |
traced_files | List[TracedFile] | All TracedFile objects parsed from *.ast.json or loaded from *.trace.xml. |
traced_files_graph | Optional[nx.DiGraph] | Dependency DAG between files (X → Y means X imports Y). None when build_deps=False. |
TracedRepo.from_traced_files()
Class method that builds a TracedRepo by parsing all *.ast.json and *.dep_paths files produced by ExtractData.lean in a traced directory. This is called internally by the tracing pipeline; you rarely need to call it directly.
Absolute path to the root of the traced repository (must be a Git repo).
Whether to construct the inter-file dependency graph.
A fully populated
TracedRepo. All TracedFile objects have their
traced_repo back-reference set.TracedRepo.load_from_disk()
Class method that rehydrates a TracedRepo from serialised *.trace.xml files previously written by save_to_disk(). Faster than re-parsing *.ast.json for large repositories.
Path to the directory containing
*.trace.xml files.Whether to rebuild the inter-file dependency graph from the loaded files.
A
TracedRepo populated from the XML files.TracedRepo.save_to_disk()
Serialises every TracedFile in the repo to a *.trace.xml file alongside the corresponding .lean source. These XML files can be loaded later with load_from_disk() without rerunning ExtractData.lean.
TracedRepo.check_sanity()
Validates the internal consistency of the traced repository. Checks that:
- All
TracedFileobjects have the correctroot_dir. - Every
.leanfile expected in the dependency graph has a corresponding*.ast.jsonand*.dep_pathsfile. - No cycles exist in the import dependency graph.
trace() before returning.
TracedRepo.get_traced_theorems()
Returns every TracedTheorem across all traced files in the repository.
Flattened list of all
TracedTheorem objects. Theorems from the Lean 4
standard library are included when build_deps=True.TracedRepo.get_traced_theorem()
Retrieves a single TracedTheorem by its Theorem key.
A
Theorem dataclass (from lean_dojo_v2.lean_dojo.data_extraction.lean)
identifying the theorem by repo, file_path, and full_name.The corresponding
TracedTheorem, or None if not found.Caching Layer
LeanDojo v2 maintains a two-level cache so that tracing a repository only ever runs once.Local cache
The default cache directory is~/.cache/lean_dojo. Override it by setting the CACHE_DIR environment variable:
Remote cache
Pre-traced repositories for popular Lean 4 projects are hosted at:.tar.gz archive for the repo exists at the remote URL and downloads it automatically. To disable remote cache downloads and force a local build, set:
| Environment variable | Effect |
|---|---|
CACHE_DIR | Override local cache root directory (default: ~/.cache/lean_dojo). |
DISABLE_REMOTE_CACHE | Set to any value to skip remote cache checks and always build locally. |
TMP_DIR | Override the temporary directory used during tracing. |
NUM_PROCS | Number of parallel processes for ExtractData.lean (default: CPU count, capped at 32). |