Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
DynamicDatabase is the persistent backbone of the LeanDojo v2 pipeline. It records every traced Lean repository as a structured Repository object, categorises every theorem by proof status (proven, sorry-proved, sorry-unproved), stores premise file metadata, tracks difficulty ratings for curriculum learning, and exports merge-ready JSONL datasets for trainer consumption. The database is backed by a single JSON file that can be committed to disk between runs, shared across machines, and incrementally updated without re-tracing anything.
Initialisation
DynamicDatabase.__init__ accepts one argument: the path to the backing JSON file. If the directory does not exist it is created automatically. A fresh empty JSON file is written on construction. Pass a path to an already-existing file and then call from_json() to reload a saved state:
Core Operations
Tracing a Repository
trace_repository() is the all-in-one entry point that clones, traces, converts, and packages a repository into a Repository object ready to be added to the database.
- Constructs a
LeanGitRepoand callsgenerate_benchmark()fromlean_dojo_v2.lean_dojo. - Checks that the repository contains at least
3 × BATCH_SIZEtheorems; repos below this threshold are skipped with a log warning andNoneis returned. - Reads the
lean-toolchainto record the Lean version. - Parses the exported
random/dataset folder,corpus.jsonl, andtraced_files.jsonlinto theRepositorydataclass. - Writes the updated database to
json_path.
Set
build_deps=True when the repository will be used with LeanAgent or RetrievalProver. This also traces all dependency packages so that premise retrieval has access to theorems defined in imported libraries.Managing Repositories
Exporting Merged Datasets
export_merged_data() combines theorems and premises from a list of Repository objects into a single dataset directory.
| Strategy | Description |
|---|---|
random | Shuffled 60/20/20 split. Used by SFTTrainer and GRPOTrainer. |
novel_premises | Theorems using the least-frequently-seen premises go to val/test. Measures out-of-distribution generalisation. |
file_path, full_name, and source positions), only the most recently processed version is kept.
Full Workflow Example
The Repository Model
A Repository object records all metadata and theorem lists for a single repo-commit pair.
| Attribute | Contents |
|---|---|
proven_theorems | Theorems with a complete tactic proof (no sorry). |
sorry_theorems_proved | Originally sorry; now proved by LeanAgent. |
sorry_theorems_unproved | Contain sorry and not yet proved. |
change_sorry_to_proven(theorem, log_file) to move it from sorry_theorems_unproved to sorry_theorems_proved and append a timestamped log entry.
The Theorem Model
Each theorem carries the metadata needed for training and deduplication:
traced_tactics is a list of AnnotatedTactic objects, each holding tactic, state_before, state_after, and a list of premise Annotation objects (full name, def path, def position).
Persistence
Discovering Repositories Automatically
discover_repositories() searches GitHub for Lean 4 repositories, filters out roughly 200+ known-unsuitable repos (benchmarks, tutorial-only repos, repos with trace problems), finds compatible commits, traces each one, and optionally sorts by difficulty.
KNOWN_REPOSITORIES exclusion list in lean_dojo_v2/utils/constants.py contains roughly 200+ entries, including leanprover-community/mathlib4 (ReProver’s training set), tutorial repos with excessive sorries, and repos that fail to trace due to Windows-style line endings or local dependencies.