DynamicDatabase: Managing Lean Theorem Repositories

DynamicDatabase is the central data-management layer in LeanDojo v2. It holds a list of Repository objects — each fully loaded with proven theorems, sorry theorems, premise files, and traced-file metadata — and persists everything to a JSON file on disk. Beyond basic CRUD operations, DynamicDatabase can autonomously search GitHub for new Lean 4 projects, trace them end-to-end, assign each theorem a numeric difficulty rating, and export a training-ready dataset split 60 / 20 / 20 across random and novel_premises strategies.

from lean_dojo_v2.database import DynamicDatabase

Constructor

json_path

str

default:"\"dynamic_database.json\""

Path to the backing JSON file. The directory is created automatically if it does not exist. A fresh, empty JSON file is written at this path on every instantiation; use from_json() to reload a previously saved database.

from lean_dojo_v2.database import DynamicDatabase

# Create (or overwrite) a database backed by my_database.json
db = DynamicDatabase(json_path="my_database.json")

DynamicDatabase.__init__ always calls to_json() immediately, so it overwrites any file already at json_path. To load an existing database, use the from_json() classmethod instead.

Repository CRUD Methods

`trace_repository()`

Clones a Lean 4 repository, runs the full LeanDojo tracing pipeline, and returns a fully populated Repository object.

repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

url

str

required

The Git remote URL or local filesystem path of the Lean 4 project. Any trailing .git suffix is stripped automatically.

commit

str

required

Full commit SHA (40 hex characters) or a tag / branch name that will be resolved to a commit hash.

build_deps

bool

default:"False"

When True, dependencies listed in lakefile.lean / lakefile.toml are also traced. Required when using LeanAgent with dependency-aware premise retrieval.

return

Optional[Repository]

A Repository object populated with proven theorems, sorry theorems, premise files, and traced-file paths. Returns None if tracing fails or if the repository contains fewer than 3 × BATCH_SIZE theorems (not enough for a train / val / test split).

GITHUB_ACCESS_TOKEN is optional but recommended. Without it, unauthenticated GitHub API calls are subject to rate limiting. Set the variable to a personal access token to increase the limit. Tracing can take roughly 1.5× the normal Lake build time for the repository.

`add_repository()`

Appends a Repository to the in-memory list. Duplicate entries (matched by url, name, commit, and version fields) are silently ignored with an INFO log message.

repo

Repository

required

The Repository object to store.

if repo:
    db.add_repository(repo)

`get_repository()`

Looks up a stored repository by its remote URL and commit hash.

url

str

required

The repository remote URL (must match exactly as stored, without .git).

commit

str

required

The full 40-character commit SHA.

return

Optional[Repository]

The matching Repository, or None if not found.

existing = db.get_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

`update_repository()`

Replaces an existing Repository entry in-place (matched by equality). Raises ValueError if the repository is not found.

updated_repo

Repository

required

The replacement Repository object. It must already exist in the database (same url, name, commit, and version fields).

db.update_repository(updated_repo)

`delete_repository()`

Removes a repository from the in-memory list. Raises ValueError if the entry does not exist.

url

str

required

Remote URL of the repository to delete.

commit

str

required

Full commit SHA of the repository to delete.

db.delete_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)

`print_database_contents()`

Logs a summary of all stored repositories (URL and commit) via the loguru logger at INFO level. Useful for quick inspection.

db.print_database_contents()
# INFO - Current database contents:
# INFO -   - https://github.com/durant42040/lean4-example (commit: 3e23ab0...)

Dataset Export Methods

`export_merged_data()`

Merges theorem and premise data from one or more repositories, deduplicates theorems by identity (keeping the most recently processed version), splits the combined pool into train / val / test sets, and writes all output files under output_path.

repositories

List[Repository]

required

One or more Repository objects whose data should be merged. All repositories contribute to a single shared corpus and theorem pool.

output_path

Path

required

Destination directory. Created automatically. Removed and recreated on each call, so any existing content is overwritten.

The export produces the following file layout:

output_path/
  random/
    train.json      # 60 % of theorems (random split)
    val.json        # 20 %
    test.json       # 20 %
  novel_premises/
    train.json      # theorems split so val/test use novel premises
    val.json
    test.json
  corpus.jsonl      # one JSON line per premise file (deduped across repos)
  traced_files.jsonl
  metadata.json

The novel_premises split ensures that the validation and test sets contain theorems whose proofs rely on premises not seen during training. This makes the benchmark significantly harder and more realistic for evaluating premise retrieval.

Split strategy detail:

Strategy	Description
`random`	Theorems are shuffled with seed `3407`, then split 60 / 20 / 20.
`novel_premises`	Theorems are assigned to val/test if their proofs use premises that appear least frequently across the full set, ensuring training-set premises differ from evaluation-set premises.

from pathlib import Path

db.export_merged_data(
    repositories=[repo],
    output_path=Path("raid/data/merged"),
)

Persistence Methods

`to_json()`

Serializes the entire database (all repositories and their theorems) to JSON.

file_path

Optional[str]

default:"None"

Target file path. Defaults to self.json_path set at construction time.

`from_json()`

Class method — creates a new DynamicDatabase instance and loads repositories from a JSON file.

file_path

str

required

Path to a previously saved DynamicDatabase JSON file.

return

DynamicDatabase

A fully populated DynamicDatabase instance.

db = DynamicDatabase.from_json("my_database.json")

`from_dict()`

Class method — reconstructs a DynamicDatabase from a plain Python dictionary (e.g., after json.load()).

data

Dict

required

Dictionary with a "repositories" key containing a list of serialized Repository dicts.

return

DynamicDatabase

A fully populated DynamicDatabase instance.

`update_json()`

Loads an existing JSON file, applies update_repository() for every repository currently in self.repositories, and writes the result back. Useful for patching individual fields without discarding unrelated entries.

file_path

str

required

Path of the JSON file to update. A new database is created at this path if the file does not exist.

Curriculum Methods

`sort_repositories_by_difficulty()`

Assigns a numeric difficulty rating to every theorem in the database, categorises them as Easy, Medium, or Hard using the 33rd and 67th percentile thresholds of the full difficulty distribution, and returns repositories sorted by their Easy theorem count in descending order (most approachable first).

return

List[Repository]

Repositories ordered most-easy-first (descending count of Easy-rated theorems). Internally the function also updates each theorem’s difficulty_rating field and persists the updated database via to_json().

sorted_repos = db.sort_repositories_by_difficulty()
# sorted_repos[0] is the repository with the most Easy theorems

Difficulty is computed by lean_dojo_v2.utils.difficulty.calculate_theorem_difficulty. It considers tactic count, proof depth, and state complexity. A To_Distribute bucket handles theorems whose difficulty cannot be determined; these are split evenly across Easy / Medium / Hard.

`discover_repositories()`

Searches GitHub for popular Lean 4 repositories, filters known noisy or incompatible projects via the KNOWN_REPOSITORIES exclusion list, traces each candidate, and optionally sorts the result by difficulty for curriculum learning.

num_repos

int

required

Number of repositories to search for and attempt to trace.

curriculum_learning

bool

required

When True, calls sort_repositories_by_difficulty() before returning so the output list is curriculum-ordered (easy-first).

build_deps

bool

default:"False"

Passed through to trace_repository() for each discovered project.

return

List[LeanGitRepo]

A list of LeanGitRepo objects representing successfully traced repositories, filtered to compatible Lean 4 versions and ordered by difficulty when curriculum_learning=True.

from lean_dojo_v2.database import DynamicDatabase

db = DynamicDatabase()

# Discover 5 repos, enable curriculum ordering
repos = db.discover_repositories(
    num_repos=5,
    curriculum_learning=True,
    build_deps=False,
)

Complete Example

The following example mirrors examples/data_generation/trace_github.py and examples/data_generation/discover_github.py:

from pathlib import Path
from lean_dojo_v2.database import DynamicDatabase

# ── Trace a single known repository ────────────────────────────────────────
db = DynamicDatabase(json_path="my_database.json")

repo = db.trace_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
    build_deps=False,
)

if repo:
    db.add_repository(repo)
    print(f"Added: {repo.name} — {repo.total_theorems} theorems")

# ── Export training data ────────────────────────────────────────────────────
db.export_merged_data(
    repositories=db.repositories,
    output_path=Path("raid/data/merged"),
)

# ── Curriculum ordering ─────────────────────────────────────────────────────
sorted_repos = db.sort_repositories_by_difficulty()
for r in sorted_repos:
    print(r.url)

# ── Auto-discover 5 repos from GitHub ──────────────────────────────────────
db2 = DynamicDatabase(json_path="discovered.json")
db2.discover_repositories(num_repos=5, curriculum_learning=True)

The `Repository` Object

Repository is a dataclass returned by trace_repository() and stored inside DynamicDatabase.repositories. Its key read-only computed properties are:

Property	Type	Description
`total_theorems`	`int`	`num_proven_theorems + num_sorry_theorems`
`num_proven_theorems`	`int`	Count of theorems with complete proofs
`num_sorry_theorems`	`int`	Count of theorems using `sorry` (proved + unproved)
`num_premise_files`	`int`	Number of files contributing premises
`num_premises`	`int`	Total premise definitions across all files
`num_files_traced`	`int`	Number of `.lean` files that were traced
`get_all_theorems`	`List[Theorem]`	Concatenation of all three theorem lists

Individual theorems can be retrieved with repo.get_theorem(full_name, file_path) and updated with repo.update_theorem(theorem).

Agents

Trainers

Provers

Data & Database

DynamicDatabase: Managing Lean Theorem Repositories

Constructor

Repository CRUD Methods

`trace_repository()`

`add_repository()`

`get_repository()`

`update_repository()`

`delete_repository()`

`print_database_contents()`

Dataset Export Methods

`export_merged_data()`

Persistence Methods

`to_json()`

`from_json()`

`from_dict()`

`update_json()`

Curriculum Methods

`sort_repositories_by_difficulty()`

`discover_repositories()`

Complete Example

The `Repository` Object

Build docs developers (and LLMs) love

Agents

Trainers

Provers

Data & Database

Documentation Index

​Constructor

​Repository CRUD Methods

​trace_repository()

​add_repository()

​get_repository()

​update_repository()

​delete_repository()

​print_database_contents()

​Dataset Export Methods

​export_merged_data()

​Persistence Methods

​to_json()

​from_json()

​from_dict()

​update_json()

​Curriculum Methods

​sort_repositories_by_difficulty()

​discover_repositories()

​Complete Example

​The Repository Object

Build docs developers (and LLMs) love

Constructor

Repository CRUD Methods

`trace_repository()`

`add_repository()`

`get_repository()`

`update_repository()`

`delete_repository()`

`print_database_contents()`

Dataset Export Methods

`export_merged_data()`

Persistence Methods

`to_json()`

`from_json()`

`from_dict()`

`update_json()`

Curriculum Methods

`sort_repositories_by_difficulty()`

`discover_repositories()`

Complete Example

The `Repository` Object