DynamicDatabase: Lifelong Theorem Repository Management

DynamicDatabase is the persistent backbone of the LeanDojo v2 pipeline. It records every traced Lean repository as a structured Repository object, categorises every theorem by proof status (proven, sorry-proved, sorry-unproved), stores premise file metadata, tracks difficulty ratings for curriculum learning, and exports merge-ready JSONL datasets for trainer consumption. The database is backed by a single JSON file that can be committed to disk between runs, shared across machines, and incrementally updated without re-tracing anything.

Initialisation

from lean_dojo_v2.database import DynamicDatabase

# Create a new database (or load an existing one by path)
db = DynamicDatabase(json_path="dynamic_database.json")

DynamicDatabase.__init__ accepts one argument: the path to the backing JSON file. If the directory does not exist it is created automatically. A fresh empty JSON file is written on construction. Pass a path to an already-existing file and then call from_json() to reload a saved state:

db = DynamicDatabase.from_json("dynamic_database.json")

Core Operations

Tracing a Repository

trace_repository() is the all-in-one entry point that clones, traces, converts, and packages a repository into a Repository object ready to be added to the database.

repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

Internally it:

Constructs a LeanGitRepo and calls generate_benchmark() from lean_dojo_v2.lean_dojo.
Checks that the repository contains at least 3 × BATCH_SIZE theorems; repos below this threshold are skipped with a log warning and None is returned.
Reads the lean-toolchain to record the Lean version.
Parses the exported random/ dataset folder, corpus.jsonl, and traced_files.jsonl into the Repository dataclass.
Writes the updated database to json_path.

Set build_deps=True when the repository will be used with LeanAgent or RetrievalProver. This also traces all dependency packages so that premise retrieval has access to theorems defined in imported libraries.

Managing Repositories

# Add a traced Repository to the in-memory list
db.add_repository(repo)

# Look up a repository by URL and commit SHA
existing = db.get_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

# Replace a repository entry (matched by url + name + commit + lean_version + lean_dojo_version)
db.update_repository(updated_repo)

# Remove a repository entry
db.delete_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

Exporting Merged Datasets

export_merged_data() combines theorems and premises from a list of Repository objects into a single dataset directory.

db.export_merged_data(
    repositories=[repo_a, repo_b],
    output_path=Path("raid/data/merged"),
)

The call produces the following layout:

raid/data/merged/
├── random/
│   ├── train.json        # 60 % of theorems (shuffled)
│   ├── val.json          # 20 %
│   └── test.json         # 20 %
├── novel_premises/
│   ├── train.json        # split by least-frequent premise usage
│   ├── val.json
│   └── test.json
├── corpus.jsonl          # deduplicated premise files (keyed by path)
├── traced_files.jsonl    # list of traced file paths
└── metadata.json         # aggregate counts and per-repo metadata

Split strategies:

Strategy	Description
`random`	Shuffled 60/20/20 split. Used by `SFTTrainer` and `GRPOTrainer`.
`novel_premises`	Theorems using the least-frequently-seen premises go to val/test. Measures out-of-distribution generalisation.

Deduplication: when multiple repositories define the same theorem (same file_path, full_name, and source positions), only the most recently processed version is kept.

Full Workflow Example

from pathlib import Path
from lean_dojo_v2.database import DynamicDatabase

db = DynamicDatabase()

repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

if repo:
    db.add_repository(repo)
    db.export_merged_data([repo], output_path=Path("raid/data/merged"))

The `Repository` Model

A Repository object records all metadata and theorem lists for a single repo-commit pair.

print(repo.url)               # "https://github.com/durant42040/lean4-example"
print(repo.commit)            # "3e23ab0b..."
print(repo.lean_version)      # "leanprover/lean4:v4.8.0"
print(repo.total_theorems)    # proven + sorry_proved + sorry_unproved
print(repo.num_sorry_theorems)
print(repo.num_premises)

Theorem lists are split by proof status:

Attribute	Contents
`proven_theorems`	Theorems with a complete tactic proof (no `sorry`).
`sorry_theorems_proved`	Originally sorry; now proved by LeanAgent.
`sorry_theorems_unproved`	Contain `sorry` and not yet proved.

When LeanAgent succeeds on a sorry theorem, call change_sorry_to_proven(theorem, log_file) to move it from sorry_theorems_unproved to sorry_theorems_proved and append a timestamped log entry.

The `Theorem` Model

Each theorem carries the metadata needed for training and deduplication:

thm = repo.proven_theorems[0]
print(thm.full_name)          # "MyNS.myLemma"
print(thm.theorem_statement)  # the statement string
print(thm.file_path)          # Path("MyLib/Basic.lean")
print(thm.start)              # Pos(line_nb=12, column_nb=1)
print(thm.difficulty_rating)  # float or None (set by sort_repositories_by_difficulty)

traced_tactics is a list of AnnotatedTactic objects, each holding tactic, state_before, state_after, and a list of premise Annotation objects (full name, def path, def position).

Persistence

# Write current state to disk
db.to_json()                          # writes to db.json_path
db.to_json("backup_database.json")    # write to an explicit path

# Load from disk into a new instance
db2 = DynamicDatabase.from_json("dynamic_database.json")

# Merge updates from self into an existing file without overwriting other entries
db.update_json("dynamic_database.json")

Discovering Repositories Automatically

discover_repositories() searches GitHub for Lean 4 repositories, filters out roughly 200+ known-unsuitable repos (benchmarks, tutorial-only repos, repos with trace problems), finds compatible commits, traces each one, and optionally sorts by difficulty.

lean_git_repos = db.discover_repositories(
    num_repos=20,
    curriculum_learning=True,
    build_deps=False,
)

The KNOWN_REPOSITORIES exclusion list in lean_dojo_v2/utils/constants.py contains roughly 200+ entries, including leanprover-community/mathlib4 (ReProver’s training set), tutorial repos with excessive sorries, and repos that fail to trace due to Windows-style line endings or local dependencies.

discover_repositories() calls the GitHub Search API, which has strict secondary rate limits. Always ensure GITHUB_ACCESS_TOKEN is set with repo and read:org scopes before running automated discovery.

Get Started

Concepts

Guides

Reference

DynamicDatabase: Lifelong Theorem Repository Management

Initialisation

Core Operations

Tracing a Repository

Managing Repositories

Exporting Merged Datasets

Full Workflow Example

The `Repository` Model

The `Theorem` Model

Persistence

Discovering Repositories Automatically

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

Documentation Index

​Initialisation

​Core Operations

​Tracing a Repository

​Managing Repositories

​Exporting Merged Datasets

​Full Workflow Example

​The Repository Model

​The Theorem Model

​Persistence

​Discovering Repositories Automatically

Build docs developers (and LLMs) love

Initialisation

Core Operations

Tracing a Repository

Managing Repositories

Exporting Merged Datasets

Full Workflow Example

The `Repository` Model

The `Theorem` Model

Persistence

Discovering Repositories Automatically