LeanDojo v2 can trace any Lean 4 repository — hosted on GitHub or stored locally — and extract a rich dataset of theorem proof states and tactics. The tracing pipeline instruments Lean’s compilation process to capture every tactic invocation, its goal state before and after execution, and the premises it references. The resulting data flows intoDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
DynamicDatabase, which manages repository metadata, curriculum ordering, and deduplication across multiple tracing runs.
Overview
Trace GitHub Repo
Point at a GitHub URL and commit hash to download, build, and trace a remote repository.
Trace Local Repo
Trace a Lean project you already have checked out on disk without any network calls.
Discover GitHub Repos
Let LeanDojo search GitHub for popular Lean 4 repos and queue them for tracing automatically.
Workflow 1 — Trace a GitHub Repository
UseDynamicDatabase.trace_repository with a GitHub URL and a specific commit SHA:
Workflow 2 — Trace a Local Repository
If you already have a Lean 4 project on disk, pass its filesystem path instead of a URL:For local repos,
commit must still be a valid git commit SHA that exists in the repository’s history. LeanDojo uses the commit hash to version-stamp the output directory, ensuring that re-tracing a later commit does not overwrite earlier data.Workflow 3 — Discover GitHub Repositories
DynamicDatabase.discover_repositories searches GitHub for popular Lean 4 repositories, filters out known-problematic ones, and queues them for tracing. When curriculum_learning=True, repositories are sorted by ascending difficulty before tracing so that the easiest repos are processed first:
raid/data/repo_info_compatible.json and excludes repos in the built-in KNOWN_REPOSITORIES blocklist (repos with trace problems, no theorems, or irrelevant content).
The build_deps Parameter
The build_deps flag controls whether LeanDojo also traces all of the repository’s transitive Lean dependencies:
build_deps=False (default)
Only the target repository is traced. Dependencies (e.g.,
Mathlib) are loaded from cache or compiled but not instrumented. Tracing is faster and the output dataset covers only the target repo’s theorems.build_deps=True
All dependency packages under
.lake/packages/ are also traced. This is required for LeanAgent, whose retrieval-augmented prover needs premise embeddings from the full dependency tree to generate accurate tactic suggestions.Dataset Format
After tracing, the data underraid/data/<repo_name>_<commit>/ follows this layout:
random (theorems split randomly) and novel_premises (val/test theorems use premises not seen in training). Each strategy produces its own train.json, val.json, and test.json.
Tactic split files (train.json, val.json, test.json)
Each file contains a list of theorem objects. Every theorem object carries the repository URL, commit, file location, and the full proof state sequence for each tactic that was executed:
| Field | Description |
|---|---|
url | GitHub URL of the traced repository. |
commit | Commit SHA that was traced. |
file_path | Source file relative to the repo root (or .lake/packages/<dep>/... for dependency theorems). |
full_name | Fully qualified Lean name of the theorem. |
theorem_statement | The theorem’s statement string, if available. |
start | [line, column] of the theorem’s first character in the source file. |
end | [line, column] of the theorem’s last character in the source file. |
traced_tactics | Ordered list of tactic steps with before/after proof states. |
tactic | The Lean tactic string as written in the source. |
annotated_tactic | The tactic paired with a list of annotated premise references. |
state_before | The proof state (hypotheses + goal) before this tactic ran. |
state_after | The proof state after this tactic ran, or "goals accomplished ✓" for closing tactics. |
Tactics where
state_before == "no goals" and tactics containing · (focus dot) are excluded from traced_tactics during export to avoid noisy training examples.Premise corpus (corpus.jsonl)
A JSONL file containing one entry per traced file, with all premise definitions exported in topological import order. Used by LeanAgent’s retriever:
Metadata (metadata.json)
Repository metadata including the URL, commit, LeanDojo version, and tracing timestamp.
Sorry Filtering
SFTDataset and GRPODataset automatically skip any tactic record where tactic == "sorry" during training. Sorrys are kept in the raw traced data so that the database can identify which theorems still need proofs, but they never appear in the fine-tuning batches.Tracing Performance Tips
Estimating tracing time
Estimating tracing time
A good rule of thumb from
trace.py: tracing takes approximately 1.5× the time required to compile the repository with lake build. For lean4-example this is a few seconds; for mathlib4 it can be several hours.Traces are cached in CACHE_DIR (default ~/.cache/lean_dojo) after the first run. Subsequent imports of the same (url, commit) pair are nearly instantaneous. A pre-traced cache for many popular repositories is also available at https://dl.fbaipublicfiles.com/lean-dojo and is downloaded automatically unless DISABLE_REMOTE_CACHE is set.Parallelism
Parallelism
The number of parallel processes used during tracing is controlled by
NUM_PROCS (default: min(cpu_count, 32)). Override it via the environment variable:Memory and CPU limits for tactic execution
Memory and CPU limits for tactic execution
When interacting with Lean during tracing, each tactic execution is bounded by:
TACTIC_CPU_LIMIT— number of CPUs (default1)TACTIC_MEMORY_LIMIT— memory cap in gigabytes, e.g."32g"(default)