Tracing is the first stage of the LeanDojo v2 pipeline. It instruments a Lean 4 repository with a custom Lean module (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
ExtractData.lean), compiles the project with lake build, then re-runs the extractor to capture structured JSON artifacts for every compiled file. The result is a fully typed Python object tree — TracedRepo → TracedFile → TracedTheorem → TracedTactic — that subsequent pipeline stages consume for dataset generation and theorem proving.
A repo only needs to be traced once. LeanDojo v2 caches the result locally at
~/.cache/lean_dojo (configurable via the CACHE_DIR environment variable) and also checks a remote cache at https://dl.fbaipublicfiles.com/lean-dojo before starting a local build. Expect tracing to take roughly 1.5× the time it takes to compile the repo with lake build.Describing a Repository: LeanGitRepo
LeanGitRepo is a frozen dataclass that fully specifies a repository and commit to trace.
| Field | Type | Description |
|---|---|---|
url | str | GitHub URL (https://), SSH URL (git@github.com:), or local path. SSH is converted to HTTPS automatically. |
commit | str | A 40-character commit SHA or a tag/branch name. Tags are resolved to SHAs on construction. |
repo_type | RepoType | Auto-detected: GITHUB, REMOTE, or LOCAL. |
lean_version | str | Parsed from the repo’s lean-toolchain file on construction. |
RepoType enum values:
RepoType.GITHUB— agithub.comHTTPS URL (most common).RepoType.REMOTE— any other reachable Git URL.RepoType.LOCAL— a local path that is a valid Git repository.
The trace() Function
trace() is the main public entry point in lean_dojo_v2.lean_dojo.data_extraction.trace.
| Parameter | Default | Description |
|---|---|---|
repo | — | The LeanGitRepo to trace. |
dst_dir | None | If provided, a copy of the traced directory is written here. Must not already exist. When None the result lives only in the cache. |
build_deps | False | When True, also traces all dependency packages (Lean stdlib included). Required for LeanAgent and RetrievalProver because retrieval needs premises from imported libraries. |
What Happens Internally
Cache check
get_traced_repo_path(repo, build_deps) checks the local cache at ~/.cache/lean_dojo. If a matching entry exists (keyed by a directory name like durant42040-lean4-example-3e23ab0b...), the step is skipped and the cached path is returned immediately.Clone and build
repo.clone_and_checkout() clones the repo into a temporary working directory and checks out the exact commit. lake build is then run; if build_deps=False, lake exe cache get is attempted first to reuse precompiled .olean files.Copy Lean stdlib
The output of
lean --print-prefix locates the active Lean toolchain. The stdlib is copied into the packages directory so ExtractData.lean can resolve all imports.The packages path depends on the Lean version:- Lean ≥ 4.3.0-rc2 →
.lake/packages/ - Lean < 4.3.0-rc2 →
lake-packages/
is_new_version(v).Run ExtractData.lean
ExtractData.lean is copied into the working directory and executed via:noDeps flag is passed when build_deps=False. This step populates build/ir/**/*.ast.json and build/ir/**/*.dep_paths for every compiled Lean file.Loading from Cache
Two class-methods onTracedRepo cover the two loading paths:
Key Data Structures
TracedRepo
Holds the list of TracedFile objects and the inter-file dependency DiGraph (when build_deps=True).
TracedFile
Wraps a single .lean source file together with its parsed AST (FileNode) and all extracted comments. Provides get_traced_theorems() and get_premise_definitions().
TracedTheorem
Represents one theorem or lemma definition. Useful properties:
TracedTactic
One tactic step annotated with proof states and premise provenance:
Version Compatibility
is_new_version(v) determines whether the active Lean version is at least 4.3.0-rc2. This affects the path layout inside the traced directory:
4.3.0-rc2 to 4.8.0-rc1. Tracing a repo that requires a version outside this range will emit a warning but may still succeed if the Lean toolchain is installed locally.