Generating Training Data from Lean 4 Repositories

LeanDojo v2 can trace any Lean 4 repository — hosted on GitHub or stored locally — and extract a rich dataset of theorem proof states and tactics. The tracing pipeline instruments Lean’s compilation process to capture every tactic invocation, its goal state before and after execution, and the premises it references. The resulting data flows into DynamicDatabase, which manages repository metadata, curriculum ordering, and deduplication across multiple tracing runs.

Overview

Trace GitHub Repo

Point at a GitHub URL and commit hash to download, build, and trace a remote repository.

Trace Local Repo

Trace a Lean project you already have checked out on disk without any network calls.

Discover GitHub Repos

Let LeanDojo search GitHub for popular Lean 4 repos and queue them for tracing automatically.

The raid/ directory is LeanDojo’s working root for datasets, checkpoints, and cached traces. It can grow to tens of gigabytes when tracing large repositories like mathlib4. Point it to a high-throughput storage volume or create a symlink before starting:

ln -s /mnt/fast-storage/lean-raid ./raid

Workflow 1 — Trace a GitHub Repository

Use DynamicDatabase.trace_repository with a GitHub URL and a specific commit SHA:

from lean_dojo_v2.database import DynamicDatabase

database = DynamicDatabase()

database.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

Or run the bundled example script directly:

python examples/data_generation/trace_github.py

"""
Example script for generating dataset from a Lean 4 GitHub repository.
The data is saved at <RAID_DIR>/<DATA_DIR>/<repo.name>_<repo.commit>.
e.g. LeanDojo-v2/raid/data/lean4-example_3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2
"""

from lean_dojo_v2.database import DynamicDatabase


def main() -> None:
    url = "https://github.com/durant42040/lean4-example"
    commit = "3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2"

    database = DynamicDatabase()

    database.trace_repository(
        url=url,
        commit=commit,
        build_deps=False,
    )


if __name__ == "__main__":
    main()

The traced artifacts are saved under:

raid/data/lean4-example_3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2/

Workflow 2 — Trace a Local Repository

If you already have a Lean 4 project on disk, pass its filesystem path instead of a URL:

from lean_dojo_v2.database import DynamicDatabase

database = DynamicDatabase()

database.trace_repository(
    url="path/to/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

"""
Example script for generating dataset from a local Lean 4 repository.
The data is saved at <RAID_DIR>/<DATA_DIR>/<repo.name>_<repo.commit>.
"""

from lean_dojo_v2.database import DynamicDatabase


def main() -> None:
    path = "path/to/lean4-example"
    commit = "3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2"

    database = DynamicDatabase()

    database.trace_repository(
        url=path,
        commit=commit,
        build_deps=False,
    )


if __name__ == "__main__":
    main()

For local repos, commit must still be a valid git commit SHA that exists in the repository’s history. LeanDojo uses the commit hash to version-stamp the output directory, ensuring that re-tracing a later commit does not overwrite earlier data.

Workflow 3 — Discover GitHub Repositories

DynamicDatabase.discover_repositories searches GitHub for popular Lean 4 repositories, filters out known-problematic ones, and queues them for tracing. When curriculum_learning=True, repositories are sorted by ascending difficulty before tracing so that the easiest repos are processed first:

from lean_dojo_v2.database import DynamicDatabase

database = DynamicDatabase()

database.discover_repositories(
    num_repos=5,
    curriculum_learning=True,
    build_deps=False,
)

"""
Example script for discovering popular Lean 4 repositories on GitHub.
The repositories are saved at <RAID_DIR>/<DATA_DIR>/repo_info_compatible.json
"""

from lean_dojo_v2.database import DynamicDatabase


def main() -> None:
    database = DynamicDatabase()

    database.discover_repositories(
        num_repos=5,
        curriculum_learning=True,
        build_deps=False,
    )


if __name__ == "__main__":
    main()

The discovered repository list is saved to raid/data/repo_info_compatible.json and excludes repos in the built-in KNOWN_REPOSITORIES blocklist (repos with trace problems, no theorems, or irrelevant content).

The `build_deps` Parameter

The build_deps flag controls whether LeanDojo also traces all of the repository’s transitive Lean dependencies:

build_deps=False (default)

Only the target repository is traced. Dependencies (e.g., Mathlib) are loaded from cache or compiled but not instrumented. Tracing is faster and the output dataset covers only the target repo’s theorems.

build_deps=True

All dependency packages under .lake/packages/ are also traced. This is required for LeanAgent, whose retrieval-augmented prover needs premise embeddings from the full dependency tree to generate accurate tactic suggestions.

# For HFAgent / ExternalAgent — fast, no dependency tracing
database.trace_repository(url=url, commit=commit, build_deps=False)

# For LeanAgent — full premise coverage required
database.trace_repository(url=url, commit=commit, build_deps=True)

Setting build_deps=True on a repository that depends on mathlib4 can take several hours and requires substantial disk space, as the entire Mathlib corpus is traced and indexed.

Dataset Format

After tracing, the data under raid/data/<repo_name>_<commit>/ follows this layout:

raid/data/lean4-example_3e23ab0.../
├── random/
│   ├── train.json
│   ├── val.json
│   └── test.json
├── novel_premises/
│   ├── train.json
│   ├── val.json
│   └── test.json
├── corpus.jsonl
└── metadata.json

The dataset is split into two strategies: random (theorems split randomly) and novel_premises (val/test theorems use premises not seen in training). Each strategy produces its own train.json, val.json, and test.json.

Tactic split files (`train.json`, `val.json`, `test.json`)

Each file contains a list of theorem objects. Every theorem object carries the repository URL, commit, file location, and the full proof state sequence for each tactic that was executed:

{
  "url": "https://github.com/durant42040/lean4-example",
  "commit": "3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
  "file_path": "MyTheorem/Basic.lean",
  "full_name": "MyTheorem.add_comm",
  "theorem_statement": "theorem MyTheorem.add_comm (a b : ℕ) : a + b = b + a := by",
  "start": [10, 0],
  "end": [12, 0],
  "traced_tactics": [
    {
      "tactic": "ring",
      "annotated_tactic": ["ring", []],
      "state_before": "a b : ℕ\n⊢ a + b = b + a",
      "state_after": "goals accomplished ✓"
    }
  ]
}

Field	Description
`url`	GitHub URL of the traced repository.
`commit`	Commit SHA that was traced.
`file_path`	Source file relative to the repo root (or `.lake/packages/<dep>/...` for dependency theorems).
`full_name`	Fully qualified Lean name of the theorem.
`theorem_statement`	The theorem’s statement string, if available.
`start`	`[line, column]` of the theorem’s first character in the source file.
`end`	`[line, column]` of the theorem’s last character in the source file.
`traced_tactics`	Ordered list of tactic steps with before/after proof states.
`tactic`	The Lean tactic string as written in the source.
`annotated_tactic`	The tactic paired with a list of annotated premise references.
`state_before`	The proof state (hypotheses + goal) before this tactic ran.
`state_after`	The proof state after this tactic ran, or `"goals accomplished ✓"` for closing tactics.

Tactics where state_before == "no goals" and tactics containing · (focus dot) are excluded from traced_tactics during export to avoid noisy training examples.

Premise corpus (`corpus.jsonl`)

A JSONL file containing one entry per traced file, with all premise definitions exported in topological import order. Used by LeanAgent’s retriever:

{"path": "MyTheorem/Basic.lean", "imports": ["MyTheorem.Defs"], "premises": [...]}

Metadata (`metadata.json`)

Repository metadata including the URL, commit, LeanDojo version, and tracing timestamp.

Sorry Filtering

SFTDataset and GRPODataset automatically skip any tactic record where tactic == "sorry" during training. Sorrys are kept in the raw traced data so that the database can identify which theorems still need proofs, but they never appear in the fine-tuning batches.

Tracing Performance Tips

Estimating tracing time

A good rule of thumb from trace.py: tracing takes approximately 1.5× the time required to compile the repository with lake build. For lean4-example this is a few seconds; for mathlib4 it can be several hours.Traces are cached in CACHE_DIR (default ~/.cache/lean_dojo) after the first run. Subsequent imports of the same (url, commit) pair are nearly instantaneous. A pre-traced cache for many popular repositories is also available at https://dl.fbaipublicfiles.com/lean-dojo and is downloaded automatically unless DISABLE_REMOTE_CACHE is set.

Parallelism

The number of parallel processes used during tracing is controlled by NUM_PROCS (default: min(cpu_count, 32)). Override it via the environment variable:

export NUM_PROCS=16

Memory and CPU limits for tactic execution

When interacting with Lean during tracing, each tactic execution is bounded by:

TACTIC_CPU_LIMIT — number of CPUs (default 1)
TACTIC_MEMORY_LIMIT — memory cap in gigabytes, e.g. "32g" (default)

Increase these if you see OOM errors on large tactic steps.

Get Started

Concepts

Guides

Reference

Generating Training Data from Lean 4 Repositories

Overview

Trace GitHub Repo

Trace Local Repo

Discover GitHub Repos

Workflow 1 — Trace a GitHub Repository

Workflow 2 — Trace a Local Repository

Workflow 3 — Discover GitHub Repositories

The `build_deps` Parameter

build_deps=False (default)

build_deps=True

Dataset Format

Tactic split files (`train.json`, `val.json`, `test.json`)

Premise corpus (`corpus.jsonl`)

Metadata (`metadata.json`)

Sorry Filtering

Tracing Performance Tips

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

Documentation Index

​Overview

Trace GitHub Repo

Trace Local Repo

Discover GitHub Repos

​Workflow 1 — Trace a GitHub Repository

​Workflow 2 — Trace a Local Repository

​Workflow 3 — Discover GitHub Repositories

​The build_deps Parameter

build_deps=False (default)

build_deps=True

​Dataset Format

​Tactic split files (train.json, val.json, test.json)

​Premise corpus (corpus.jsonl)

​Metadata (metadata.json)

​Sorry Filtering

​Tracing Performance Tips

Build docs developers (and LLMs) love

Overview

Workflow 1 — Trace a GitHub Repository

Workflow 2 — Trace a Local Repository

Workflow 3 — Discover GitHub Repositories

The `build_deps` Parameter

Dataset Format

Tactic split files (`train.json`, `val.json`, `test.json`)

Premise corpus (`corpus.jsonl`)

Metadata (`metadata.json`)

Sorry Filtering

Tracing Performance Tips