Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

DynamicDatabase is the persistent backbone of the LeanDojo v2 pipeline. It records every traced Lean repository as a structured Repository object, categorises every theorem by proof status (proven, sorry-proved, sorry-unproved), stores premise file metadata, tracks difficulty ratings for curriculum learning, and exports merge-ready JSONL datasets for trainer consumption. The database is backed by a single JSON file that can be committed to disk between runs, shared across machines, and incrementally updated without re-tracing anything.

Initialisation

from lean_dojo_v2.database import DynamicDatabase

# Create a new database (or load an existing one by path)
db = DynamicDatabase(json_path="dynamic_database.json")
DynamicDatabase.__init__ accepts one argument: the path to the backing JSON file. If the directory does not exist it is created automatically. A fresh empty JSON file is written on construction. Pass a path to an already-existing file and then call from_json() to reload a saved state:
db = DynamicDatabase.from_json("dynamic_database.json")

Core Operations

Tracing a Repository

trace_repository() is the all-in-one entry point that clones, traces, converts, and packages a repository into a Repository object ready to be added to the database.
repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)
Internally it:
  1. Constructs a LeanGitRepo and calls generate_benchmark() from lean_dojo_v2.lean_dojo.
  2. Checks that the repository contains at least 3 × BATCH_SIZE theorems; repos below this threshold are skipped with a log warning and None is returned.
  3. Reads the lean-toolchain to record the Lean version.
  4. Parses the exported random/ dataset folder, corpus.jsonl, and traced_files.jsonl into the Repository dataclass.
  5. Writes the updated database to json_path.
Set build_deps=True when the repository will be used with LeanAgent or RetrievalProver. This also traces all dependency packages so that premise retrieval has access to theorems defined in imported libraries.

Managing Repositories

# Add a traced Repository to the in-memory list
db.add_repository(repo)

# Look up a repository by URL and commit SHA
existing = db.get_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

# Replace a repository entry (matched by url + name + commit + lean_version + lean_dojo_version)
db.update_repository(updated_repo)

# Remove a repository entry
db.delete_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

Exporting Merged Datasets

export_merged_data() combines theorems and premises from a list of Repository objects into a single dataset directory.
db.export_merged_data(
    repositories=[repo_a, repo_b],
    output_path=Path("raid/data/merged"),
)
The call produces the following layout:
raid/data/merged/
├── random/
│   ├── train.json        # 60 % of theorems (shuffled)
│   ├── val.json          # 20 %
│   └── test.json         # 20 %
├── novel_premises/
│   ├── train.json        # split by least-frequent premise usage
│   ├── val.json
│   └── test.json
├── corpus.jsonl          # deduplicated premise files (keyed by path)
├── traced_files.jsonl    # list of traced file paths
└── metadata.json         # aggregate counts and per-repo metadata
Split strategies:
StrategyDescription
randomShuffled 60/20/20 split. Used by SFTTrainer and GRPOTrainer.
novel_premisesTheorems using the least-frequently-seen premises go to val/test. Measures out-of-distribution generalisation.
Deduplication: when multiple repositories define the same theorem (same file_path, full_name, and source positions), only the most recently processed version is kept.

Full Workflow Example

from pathlib import Path
from lean_dojo_v2.database import DynamicDatabase

db = DynamicDatabase()

repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

if repo:
    db.add_repository(repo)
    db.export_merged_data([repo], output_path=Path("raid/data/merged"))

The Repository Model

A Repository object records all metadata and theorem lists for a single repo-commit pair.
print(repo.url)               # "https://github.com/durant42040/lean4-example"
print(repo.commit)            # "3e23ab0b..."
print(repo.lean_version)      # "leanprover/lean4:v4.8.0"
print(repo.total_theorems)    # proven + sorry_proved + sorry_unproved
print(repo.num_sorry_theorems)
print(repo.num_premises)
Theorem lists are split by proof status:
AttributeContents
proven_theoremsTheorems with a complete tactic proof (no sorry).
sorry_theorems_provedOriginally sorry; now proved by LeanAgent.
sorry_theorems_unprovedContain sorry and not yet proved.
When LeanAgent succeeds on a sorry theorem, call change_sorry_to_proven(theorem, log_file) to move it from sorry_theorems_unproved to sorry_theorems_proved and append a timestamped log entry.

The Theorem Model

Each theorem carries the metadata needed for training and deduplication:
thm = repo.proven_theorems[0]
print(thm.full_name)          # "MyNS.myLemma"
print(thm.theorem_statement)  # the statement string
print(thm.file_path)          # Path("MyLib/Basic.lean")
print(thm.start)              # Pos(line_nb=12, column_nb=1)
print(thm.difficulty_rating)  # float or None (set by sort_repositories_by_difficulty)
traced_tactics is a list of AnnotatedTactic objects, each holding tactic, state_before, state_after, and a list of premise Annotation objects (full name, def path, def position).

Persistence

# Write current state to disk
db.to_json()                          # writes to db.json_path
db.to_json("backup_database.json")    # write to an explicit path

# Load from disk into a new instance
db2 = DynamicDatabase.from_json("dynamic_database.json")

# Merge updates from self into an existing file without overwriting other entries
db.update_json("dynamic_database.json")

Discovering Repositories Automatically

discover_repositories() searches GitHub for Lean 4 repositories, filters out roughly 200+ known-unsuitable repos (benchmarks, tutorial-only repos, repos with trace problems), finds compatible commits, traces each one, and optionally sorts by difficulty.
lean_git_repos = db.discover_repositories(
    num_repos=20,
    curriculum_learning=True,
    build_deps=False,
)
The KNOWN_REPOSITORIES exclusion list in lean_dojo_v2/utils/constants.py contains roughly 200+ entries, including leanprover-community/mathlib4 (ReProver’s training set), tutorial repos with excessive sorries, and repos that fail to trace due to Windows-style line endings or local dependencies.
discover_repositories() calls the GitHub Search API, which has strict secondary rate limits. Always ensure GITHUB_ACCESS_TOKEN is set with repo and read:org scopes before running automated discovery.

Build docs developers (and LLMs) love