Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

DynamicDatabase is the central data-management layer in LeanDojo v2. It holds a list of Repository objects — each fully loaded with proven theorems, sorry theorems, premise files, and traced-file metadata — and persists everything to a JSON file on disk. Beyond basic CRUD operations, DynamicDatabase can autonomously search GitHub for new Lean 4 projects, trace them end-to-end, assign each theorem a numeric difficulty rating, and export a training-ready dataset split 60 / 20 / 20 across random and novel_premises strategies.
from lean_dojo_v2.database import DynamicDatabase

Constructor

json_path
str
default:"\"dynamic_database.json\""
Path to the backing JSON file. The directory is created automatically if it does not exist. A fresh, empty JSON file is written at this path on every instantiation; use from_json() to reload a previously saved database.
from lean_dojo_v2.database import DynamicDatabase

# Create (or overwrite) a database backed by my_database.json
db = DynamicDatabase(json_path="my_database.json")
DynamicDatabase.__init__ always calls to_json() immediately, so it overwrites any file already at json_path. To load an existing database, use the from_json() classmethod instead.

Repository CRUD Methods

trace_repository()

Clones a Lean 4 repository, runs the full LeanDojo tracing pipeline, and returns a fully populated Repository object.
repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)
url
str
required
The Git remote URL or local filesystem path of the Lean 4 project. Any trailing .git suffix is stripped automatically.
commit
str
required
Full commit SHA (40 hex characters) or a tag / branch name that will be resolved to a commit hash.
build_deps
bool
default:"False"
When True, dependencies listed in lakefile.lean / lakefile.toml are also traced. Required when using LeanAgent with dependency-aware premise retrieval.
return
Optional[Repository]
A Repository object populated with proven theorems, sorry theorems, premise files, and traced-file paths. Returns None if tracing fails or if the repository contains fewer than 3 × BATCH_SIZE theorems (not enough for a train / val / test split).
GITHUB_ACCESS_TOKEN is optional but recommended. Without it, unauthenticated GitHub API calls are subject to rate limiting. Set the variable to a personal access token to increase the limit. Tracing can take roughly 1.5× the normal Lake build time for the repository.

add_repository()

Appends a Repository to the in-memory list. Duplicate entries (matched by url, name, commit, and version fields) are silently ignored with an INFO log message.
repo
Repository
required
The Repository object to store.
if repo:
    db.add_repository(repo)

get_repository()

Looks up a stored repository by its remote URL and commit hash.
url
str
required
The repository remote URL (must match exactly as stored, without .git).
commit
str
required
The full 40-character commit SHA.
return
Optional[Repository]
The matching Repository, or None if not found.
existing = db.get_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

update_repository()

Replaces an existing Repository entry in-place (matched by equality). Raises ValueError if the repository is not found.
updated_repo
Repository
required
The replacement Repository object. It must already exist in the database (same url, name, commit, and version fields).
db.update_repository(updated_repo)

delete_repository()

Removes a repository from the in-memory list. Raises ValueError if the entry does not exist.
url
str
required
Remote URL of the repository to delete.
commit
str
required
Full commit SHA of the repository to delete.
db.delete_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

Logs a summary of all stored repositories (URL and commit) via the loguru logger at INFO level. Useful for quick inspection.
db.print_database_contents()
# INFO - Current database contents:
# INFO -   - https://github.com/durant42040/lean4-example (commit: 3e23ab0...)

Dataset Export Methods

export_merged_data()

Merges theorem and premise data from one or more repositories, deduplicates theorems by identity (keeping the most recently processed version), splits the combined pool into train / val / test sets, and writes all output files under output_path.
repositories
List[Repository]
required
One or more Repository objects whose data should be merged. All repositories contribute to a single shared corpus and theorem pool.
output_path
Path
required
Destination directory. Created automatically. Removed and recreated on each call, so any existing content is overwritten.
The export produces the following file layout:
output_path/
  random/
    train.json      # 60 % of theorems (random split)
    val.json        # 20 %
    test.json       # 20 %
  novel_premises/
    train.json      # theorems split so val/test use novel premises
    val.json
    test.json
  corpus.jsonl      # one JSON line per premise file (deduped across repos)
  traced_files.jsonl
  metadata.json
The novel_premises split ensures that the validation and test sets contain theorems whose proofs rely on premises not seen during training. This makes the benchmark significantly harder and more realistic for evaluating premise retrieval.
Split strategy detail:
StrategyDescription
randomTheorems are shuffled with seed 3407, then split 60 / 20 / 20.
novel_premisesTheorems are assigned to val/test if their proofs use premises that appear least frequently across the full set, ensuring training-set premises differ from evaluation-set premises.
from pathlib import Path

db.export_merged_data(
    repositories=[repo],
    output_path=Path("raid/data/merged"),
)

Persistence Methods

to_json()

Serializes the entire database (all repositories and their theorems) to JSON.
file_path
Optional[str]
default:"None"
Target file path. Defaults to self.json_path set at construction time.

from_json()

Class method — creates a new DynamicDatabase instance and loads repositories from a JSON file.
file_path
str
required
Path to a previously saved DynamicDatabase JSON file.
return
DynamicDatabase
A fully populated DynamicDatabase instance.
db = DynamicDatabase.from_json("my_database.json")

from_dict()

Class method — reconstructs a DynamicDatabase from a plain Python dictionary (e.g., after json.load()).
data
Dict
required
Dictionary with a "repositories" key containing a list of serialized Repository dicts.
return
DynamicDatabase
A fully populated DynamicDatabase instance.

update_json()

Loads an existing JSON file, applies update_repository() for every repository currently in self.repositories, and writes the result back. Useful for patching individual fields without discarding unrelated entries.
file_path
str
required
Path of the JSON file to update. A new database is created at this path if the file does not exist.

Curriculum Methods

sort_repositories_by_difficulty()

Assigns a numeric difficulty rating to every theorem in the database, categorises them as Easy, Medium, or Hard using the 33rd and 67th percentile thresholds of the full difficulty distribution, and returns repositories sorted by their Easy theorem count in descending order (most approachable first).
return
List[Repository]
Repositories ordered most-easy-first (descending count of Easy-rated theorems). Internally the function also updates each theorem’s difficulty_rating field and persists the updated database via to_json().
sorted_repos = db.sort_repositories_by_difficulty()
# sorted_repos[0] is the repository with the most Easy theorems
Difficulty is computed by lean_dojo_v2.utils.difficulty.calculate_theorem_difficulty. It considers tactic count, proof depth, and state complexity. A To_Distribute bucket handles theorems whose difficulty cannot be determined; these are split evenly across Easy / Medium / Hard.

discover_repositories()

Searches GitHub for popular Lean 4 repositories, filters known noisy or incompatible projects via the KNOWN_REPOSITORIES exclusion list, traces each candidate, and optionally sorts the result by difficulty for curriculum learning.
num_repos
int
required
Number of repositories to search for and attempt to trace.
curriculum_learning
bool
required
When True, calls sort_repositories_by_difficulty() before returning so the output list is curriculum-ordered (easy-first).
build_deps
bool
default:"False"
Passed through to trace_repository() for each discovered project.
return
List[LeanGitRepo]
A list of LeanGitRepo objects representing successfully traced repositories, filtered to compatible Lean 4 versions and ordered by difficulty when curriculum_learning=True.
from lean_dojo_v2.database import DynamicDatabase

db = DynamicDatabase()

# Discover 5 repos, enable curriculum ordering
repos = db.discover_repositories(
    num_repos=5,
    curriculum_learning=True,
    build_deps=False,
)

Complete Example

The following example mirrors examples/data_generation/trace_github.py and examples/data_generation/discover_github.py:
from pathlib import Path
from lean_dojo_v2.database import DynamicDatabase

# ── Trace a single known repository ────────────────────────────────────────
db = DynamicDatabase(json_path="my_database.json")

repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

if repo:
    db.add_repository(repo)
    print(f"Added: {repo.name}{repo.total_theorems} theorems")

# ── Export training data ────────────────────────────────────────────────────
db.export_merged_data(
    repositories=db.repositories,
    output_path=Path("raid/data/merged"),
)

# ── Curriculum ordering ─────────────────────────────────────────────────────
sorted_repos = db.sort_repositories_by_difficulty()
for r in sorted_repos:
    print(r.url)

# ── Auto-discover 5 repos from GitHub ──────────────────────────────────────
db2 = DynamicDatabase(json_path="discovered.json")
db2.discover_repositories(num_repos=5, curriculum_learning=True)

The Repository Object

Repository is a dataclass returned by trace_repository() and stored inside DynamicDatabase.repositories. Its key read-only computed properties are:
PropertyTypeDescription
total_theoremsintnum_proven_theorems + num_sorry_theorems
num_proven_theoremsintCount of theorems with complete proofs
num_sorry_theoremsintCount of theorems using sorry (proved + unproved)
num_premise_filesintNumber of files contributing premises
num_premisesintTotal premise definitions across all files
num_files_tracedintNumber of .lean files that were traced
get_all_theoremsList[Theorem]Concatenation of all three theorem lists
Individual theorems can be retrieved with repo.get_theorem(full_name, file_path) and updated with repo.update_theorem(theorem).

Build docs developers (and LLMs) love