Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

The lean_dojo_v2.lean_dojo.data_extraction module is the engine that turns a raw Lean 4 repository into a richly annotated dataset. It builds an abstract syntax tree for every .lean file, records tactic states before and after each step, resolves premise provenance, and stores everything in a persistent cache so the expensive build-and-extract pass only ever runs once per commit.
from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo
from lean_dojo_v2.lean_dojo.data_extraction.trace import (
    trace,
    get_traced_repo_path,
    is_available_in_cache,
)
from lean_dojo_v2.lean_dojo.data_extraction.traced_data import TracedRepo
Tracing takes approximately 1.5× the normal Lake compile time for a repository. Results are stored in CACHE_DIR (default ~/.cache/lean_dojo) and a remote cache at https://dl.fbaipublicfiles.com/lean-dojo. Subsequent calls for the same repo and commit return immediately from cache.

trace()

The primary entry-point for tracing a Lean 4 repository. Checks the local and remote caches first; only runs the full extraction pipeline on a cache miss.
from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo
from lean_dojo_v2.lean_dojo.data_extraction.trace import trace

repo = LeanGitRepo(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)
traced = trace(repo, dst_dir="traced_output", build_deps=False)
repo
LeanGitRepo
required
The repository to trace. Must pass repo.exists() — the commit must be accessible at the given URL.
dst_dir
Optional[Union[str, Path]]
default:"None"
Directory where the traced output is copied after caching. When None, the traced repo lives only inside CACHE_DIR and the returned TracedRepo points there. When provided, the directory is created and the cache contents are copied into dst_dir/<cache_dirname>.
build_deps
bool
default:"False"
When True, dependency packages declared in lakefile.lean / lakefile.toml are also traced. This is required for dependency-aware premise retrieval (e.g., LeanAgent). When False, lake exe cache get is attempted first to skip recompiling dependencies.
return
TracedRepo
A TracedRepo object loaded from cache (or from dst_dir if supplied). check_sanity() is called automatically before returning.
dst_dir must not already exist. If the path exists when trace() is called, it raises an AssertionError. Remove or rename the directory first.

get_traced_repo_path()

Returns the absolute cache path of a traced repository, running the full trace pipeline if the repo is not yet cached.
from lean_dojo_v2.lean_dojo.data_extraction.trace import get_traced_repo_path

path = get_traced_repo_path(repo, build_deps=False)
print(path)
# /home/user/.cache/lean_dojo/durant42040-lean4-example-3e23ab0.../lean4-example
repo
LeanGitRepo
required
The repository whose cache path to look up or compute.
build_deps
bool
default:"False"
If True, looks for (or creates) the <cache_dirname>/<name>_d variant that includes traced dependency packages. If a _d entry already exists in cache and build_deps=False, it is reused automatically.
return
Path
Absolute path to the cached traced repo directory, e.g. ~/.cache/lean_dojo/durant42040-lean4-example-3e23ab0.../lean4-example.

is_available_in_cache()

Lightweight check — does not trace anything. Queries both the local cache directory and the remote cache URL.
from lean_dojo_v2.lean_dojo.data_extraction.trace import is_available_in_cache

if is_available_in_cache(repo):
    print("Cache hit — tracing will be instant.")
else:
    print("Cache miss — expect a full trace run.")
repo
LeanGitRepo
required
The repository to check.
return
bool
True if a cached traced version exists locally or is available for download from the remote cache. False otherwise.

TracedRepo Class

TracedRepo is a frozen dataclass that represents the fully traced version of a LeanGitRepo. It holds all TracedFile objects produced by the extraction pipeline, an optional dependency graph between files, and a reference back to the original LeanGitRepo.
from lean_dojo_v2.lean_dojo.data_extraction.traced_data import TracedRepo

Core Fields

FieldTypeDescription
repoLeanGitRepoThe original repository that was traced.
dependenciesDict[str, LeanGitRepo]Maps dependency package names to their LeanGitRepo objects.
root_dirPathAbsolute path to the traced repo root on disk.
traced_filesList[TracedFile]All TracedFile objects parsed from *.ast.json or loaded from *.trace.xml.
traced_files_graphOptional[nx.DiGraph]Dependency DAG between files (X → Y means X imports Y). None when build_deps=False.

TracedRepo.from_traced_files()

Class method that builds a TracedRepo by parsing all *.ast.json and *.dep_paths files produced by ExtractData.lean in a traced directory. This is called internally by the tracing pipeline; you rarely need to call it directly.
root_dir
Union[str, Path]
required
Absolute path to the root of the traced repository (must be a Git repo).
build_deps
bool
default:"False"
Whether to construct the inter-file dependency graph.
return
TracedRepo
A fully populated TracedRepo. All TracedFile objects have their traced_repo back-reference set.

TracedRepo.load_from_disk()

Class method that rehydrates a TracedRepo from serialised *.trace.xml files previously written by save_to_disk(). Faster than re-parsing *.ast.json for large repositories.
root_dir
Union[str, Path]
required
Path to the directory containing *.trace.xml files.
build_deps
bool
default:"False"
Whether to rebuild the inter-file dependency graph from the loaded files.
return
TracedRepo
A TracedRepo populated from the XML files.

TracedRepo.save_to_disk()

Serialises every TracedFile in the repo to a *.trace.xml file alongside the corresponding .lean source. These XML files can be loaded later with load_from_disk() without rerunning ExtractData.lean.
traced_repo.save_to_disk()

TracedRepo.check_sanity()

Validates the internal consistency of the traced repository. Checks that:
  • All TracedFile objects have the correct root_dir.
  • Every .lean file expected in the dependency graph has a corresponding *.ast.json and *.dep_paths file.
  • No cycles exist in the import dependency graph.
Raises exceptions on failure. Called automatically by trace() before returning.
traced_repo.check_sanity()  # raises on inconsistency

TracedRepo.get_traced_theorems()

Returns every TracedTheorem across all traced files in the repository.
return
List[TracedTheorem]
Flattened list of all TracedTheorem objects. Theorems from the Lean 4 standard library are included when build_deps=True.

TracedRepo.get_traced_theorem()

Retrieves a single TracedTheorem by its Theorem key.
thm
Theorem
required
A Theorem dataclass (from lean_dojo_v2.lean_dojo.data_extraction.lean) identifying the theorem by repo, file_path, and full_name.
return
Optional[TracedTheorem]
The corresponding TracedTheorem, or None if not found.

Caching Layer

LeanDojo v2 maintains a two-level cache so that tracing a repository only ever runs once.

Local cache

The default cache directory is ~/.cache/lean_dojo. Override it by setting the CACHE_DIR environment variable:
export CACHE_DIR=/mnt/fast-disk/.lean_dojo_cache
Each traced repo is stored at:
$CACHE_DIR/<owner>-<repo>-<commit>/          # parent tarball
$CACHE_DIR/<owner>-<repo>-<commit>/<repo>/   # actual traced tree

Remote cache

Pre-traced repositories for popular Lean 4 projects are hosted at:
https://dl.fbaipublicfiles.com/lean-dojo
When a local cache miss occurs, LeanDojo checks whether a .tar.gz archive for the repo exists at the remote URL and downloads it automatically. To disable remote cache downloads and force a local build, set:
export DISABLE_REMOTE_CACHE=1
Environment variableEffect
CACHE_DIROverride local cache root directory (default: ~/.cache/lean_dojo).
DISABLE_REMOTE_CACHESet to any value to skip remote cache checks and always build locally.
TMP_DIROverride the temporary directory used during tracing.
NUM_PROCSNumber of parallel processes for ExtractData.lean (default: CPU count, capped at 32).

Complete Examples

Trace a GitHub repository

from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo
from lean_dojo_v2.lean_dojo.data_extraction.trace import trace

repo = LeanGitRepo(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

# Trace and copy output to ./traced_output/<cache_dirname>/
traced = trace(repo, dst_dir="traced_output", build_deps=False)

print(f"Traced {len(traced.traced_files)} files")
print(f"Found {len(traced.get_traced_theorems())} theorems")

Trace a local repository

from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo
from lean_dojo_v2.lean_dojo.data_extraction.trace import trace

# Infer commit from local HEAD
local_repo = LeanGitRepo.from_path("path/to/lean4-example")

traced = trace(local_repo, build_deps=False)
for thm in traced.get_traced_theorems():
    print(thm.theorem.full_name, "—", len(thm.get_traced_tactics()), "tactics")

Check cache before tracing

from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo
from lean_dojo_v2.lean_dojo.data_extraction.trace import (
    is_available_in_cache,
    get_traced_repo_path,
)

repo = LeanGitRepo(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

if is_available_in_cache(repo):
    path = get_traced_repo_path(repo)          # returns instantly from cache
    print(f"Loaded from cache: {path}")
else:
    print("Not cached — will trace (this may take a while).")
    path = get_traced_repo_path(repo)          # runs full trace pipeline

Load a previously saved TracedRepo

from lean_dojo_v2.lean_dojo.data_extraction.traced_data import TracedRepo

# Load from *.trace.xml files written by save_to_disk()
traced = TracedRepo.load_from_disk("/home/user/.cache/lean_dojo/.../lean4-example")
traced.check_sanity()

for thm in traced.get_traced_theorems():
    tactics = thm.get_traced_tactics()
    for tac in tactics:
        print(tac.state_before)
        print("  →", tac.tactic)
        print(tac.state_after)

Build docs developers (and LLMs) love