Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

Tracing is the first stage of the LeanDojo v2 pipeline. It instruments a Lean 4 repository with a custom Lean module (ExtractData.lean), compiles the project with lake build, then re-runs the extractor to capture structured JSON artifacts for every compiled file. The result is a fully typed Python object tree — TracedRepoTracedFileTracedTheoremTracedTactic — that subsequent pipeline stages consume for dataset generation and theorem proving.
A repo only needs to be traced once. LeanDojo v2 caches the result locally at ~/.cache/lean_dojo (configurable via the CACHE_DIR environment variable) and also checks a remote cache at https://dl.fbaipublicfiles.com/lean-dojo before starting a local build. Expect tracing to take roughly 1.5× the time it takes to compile the repo with lake build.

Describing a Repository: LeanGitRepo

LeanGitRepo is a frozen dataclass that fully specifies a repository and commit to trace.
from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo

repo = LeanGitRepo(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)
Fields:
FieldTypeDescription
urlstrGitHub URL (https://), SSH URL (git@github.com:), or local path. SSH is converted to HTTPS automatically.
commitstrA 40-character commit SHA or a tag/branch name. Tags are resolved to SHAs on construction.
repo_typeRepoTypeAuto-detected: GITHUB, REMOTE, or LOCAL.
lean_versionstrParsed from the repo’s lean-toolchain file on construction.
RepoType enum values:
  • RepoType.GITHUB — a github.com HTTPS URL (most common).
  • RepoType.REMOTE — any other reachable Git URL.
  • RepoType.LOCAL — a local path that is a valid Git repository.
For local repos, use the convenience constructor:
from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo

# Reads HEAD commit automatically
repo = LeanGitRepo.from_path("/path/to/my-lean-project")
GitHub repos require GITHUB_ACCESS_TOKEN to be set in your environment. The tracer uses the GitHub API to resolve tags, read lean-toolchain, and check lake-manifest.json. Without a token you will hit the unauthenticated rate limit (60 requests/hour) and tracing will fail.
export GITHUB_ACCESS_TOKEN=ghp_...

The trace() Function

trace() is the main public entry point in lean_dojo_v2.lean_dojo.data_extraction.trace.
from lean_dojo_v2.lean_dojo.data_extraction.lean import LeanGitRepo
from lean_dojo_v2.lean_dojo.data_extraction.trace import trace

repo = LeanGitRepo(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)
traced = trace(repo, dst_dir="my_traced_repo", build_deps=False)
Signature:
def trace(
    repo: LeanGitRepo,
    dst_dir: Optional[Union[str, Path]] = None,
    build_deps: bool = False,
) -> TracedRepo:
    ...
Parameters:
ParameterDefaultDescription
repoThe LeanGitRepo to trace.
dst_dirNoneIf provided, a copy of the traced directory is written here. Must not already exist. When None the result lives only in the cache.
build_depsFalseWhen True, also traces all dependency packages (Lean stdlib included). Required for LeanAgent and RetrievalProver because retrieval needs premises from imported libraries.

What Happens Internally

1

Cache check

get_traced_repo_path(repo, build_deps) checks the local cache at ~/.cache/lean_dojo. If a matching entry exists (keyed by a directory name like durant42040-lean4-example-3e23ab0b...), the step is skipped and the cached path is returned immediately.
2

Clone and build

repo.clone_and_checkout() clones the repo into a temporary working directory and checks out the exact commit. lake build is then run; if build_deps=False, lake exe cache get is attempted first to reuse precompiled .olean files.
3

Copy Lean stdlib

The output of lean --print-prefix locates the active Lean toolchain. The stdlib is copied into the packages directory so ExtractData.lean can resolve all imports.The packages path depends on the Lean version:
  • Lean ≥ 4.3.0-rc2 → .lake/packages/
  • Lean < 4.3.0-rc2 → lake-packages/
This version check is performed by is_new_version(v).
4

Run ExtractData.lean

ExtractData.lean is copied into the working directory and executed via:
lake env lean --threads <N> --run ExtractData.lean [noDeps]
The noDeps flag is passed when build_deps=False. This step populates build/ir/**/*.ast.json and build/ir/**/*.dep_paths for every compiled Lean file.
5

Build TracedRepo and cache

TracedRepo.from_traced_files(src_dir, build_deps) parses all .ast.json files into Python objects. The result is serialised to *.trace.xml files via save_to_disk(), and the whole directory is stored in the cache.

Loading from Cache

Two class-methods on TracedRepo cover the two loading paths:
from lean_dojo_v2.lean_dojo.data_extraction.traced_data import TracedRepo

# Parse raw *.ast.json files produced by ExtractData.lean
traced = TracedRepo.from_traced_files(root_dir="path/to/repo", build_deps=False)

# Load from already-serialised *.trace.xml files (faster)
traced = TracedRepo.load_from_disk(root_dir="path/to/cached-repo", build_deps=False)
You can also check or obtain the cache path directly:
from lean_dojo_v2.lean_dojo.data_extraction.trace import (
    is_available_in_cache,
    get_traced_repo_path,
)

if is_available_in_cache(repo):
    path = get_traced_repo_path(repo, build_deps=False)
    print(f"Cached at: {path}")

Key Data Structures

TracedRepo

Holds the list of TracedFile objects and the inter-file dependency DiGraph (when build_deps=True).
theorems = traced.get_traced_theorems()   # List[TracedTheorem]

TracedFile

Wraps a single .lean source file together with its parsed AST (FileNode) and all extracted comments. Provides get_traced_theorems() and get_premise_definitions().

TracedTheorem

Represents one theorem or lemma definition. Useful properties:
t = theorems[0]
print(t.theorem.full_name)          # "MyNamespace.myLemma"
print(t.get_theorem_statement())    # the statement string (comments stripped)
print(t.get_tactic_proof())         # full tactic block as a string
tactics = t.get_traced_tactics()    # List[TracedTactic]

TracedTactic

One tactic step annotated with proof states and premise provenance:
tac = tactics[0]
print(tac.tactic)         # "simp [Nat.add_comm]"
print(tac.state_before)   # "⊢ n + m = m + n"
print(tac.state_after)    # "no goals"

annotated, provenances = tac.get_annotated_tactic()
# annotated → "simp [<a>Nat.add_comm</a>]"
# provenances → [{"full_name": "Nat.add_comm", "def_path": ..., ...}]

Version Compatibility

is_new_version(v) determines whether the active Lean version is at least 4.3.0-rc2. This affects the path layout inside the traced directory:
from lean_dojo_v2.lean_dojo.data_extraction.trace import is_new_version

is_new_version("4.3.0-rc2")   # True  → .lake/packages/
is_new_version("4.2.0")       # False → lake-packages/
is_new_version("4.8.0")       # True  → .lake/packages/
Supported versions range from 4.3.0-rc2 to 4.8.0-rc1. Tracing a repo that requires a version outside this range will emit a warning but may still succeed if the Lean toolchain is installed locally.
To disable the remote cache and force a local build (useful for offline environments or custom Lean forks), set the environment variable DISABLE_REMOTE_CACHE=1 before running.

Build docs developers (and LLMs) love