Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
DynamicDatabase is the central data-management layer in LeanDojo v2. It holds a list of Repository objects — each fully loaded with proven theorems, sorry theorems, premise files, and traced-file metadata — and persists everything to a JSON file on disk. Beyond basic CRUD operations, DynamicDatabase can autonomously search GitHub for new Lean 4 projects, trace them end-to-end, assign each theorem a numeric difficulty rating, and export a training-ready dataset split 60 / 20 / 20 across random and novel_premises strategies.
Constructor
Path to the backing JSON file. The directory is created automatically if it does
not exist. A fresh, empty JSON file is written at this path on every
instantiation; use
from_json() to reload a previously saved database.DynamicDatabase.__init__ always calls to_json() immediately, so it overwrites any file already at json_path. To load an existing database, use the from_json() classmethod instead.Repository CRUD Methods
trace_repository()
Clones a Lean 4 repository, runs the full LeanDojo tracing pipeline, and returns a fully populated Repository object.
The Git remote URL or local filesystem path of the Lean 4 project. Any trailing
.git suffix is stripped automatically.Full commit SHA (40 hex characters) or a tag / branch name that will be
resolved to a commit hash.
When
True, dependencies listed in lakefile.lean / lakefile.toml are also
traced. Required when using LeanAgent with dependency-aware premise retrieval.A
Repository object populated with proven theorems, sorry theorems, premise
files, and traced-file paths. Returns None if tracing fails or if the
repository contains fewer than 3 × BATCH_SIZE theorems (not enough for a
train / val / test split).GITHUB_ACCESS_TOKEN is optional but recommended. Without it, unauthenticated
GitHub API calls are subject to rate limiting. Set the variable to a personal
access token to increase the limit. Tracing can take roughly 1.5× the normal
Lake build time for the repository.add_repository()
Appends a Repository to the in-memory list. Duplicate entries (matched by url, name, commit, and version fields) are silently ignored with an INFO log message.
The
Repository object to store.get_repository()
Looks up a stored repository by its remote URL and commit hash.
The repository remote URL (must match exactly as stored, without
.git).The full 40-character commit SHA.
The matching
Repository, or None if not found.update_repository()
Replaces an existing Repository entry in-place (matched by equality). Raises ValueError if the repository is not found.
The replacement
Repository object. It must already exist in the database
(same url, name, commit, and version fields).delete_repository()
Removes a repository from the in-memory list. Raises ValueError if the entry does not exist.
Remote URL of the repository to delete.
Full commit SHA of the repository to delete.
print_database_contents()
Logs a summary of all stored repositories (URL and commit) via the loguru logger at INFO level. Useful for quick inspection.
Dataset Export Methods
export_merged_data()
Merges theorem and premise data from one or more repositories, deduplicates theorems by identity (keeping the most recently processed version), splits the combined pool into train / val / test sets, and writes all output files under output_path.
One or more
Repository objects whose data should be merged. All repositories
contribute to a single shared corpus and theorem pool.Destination directory. Created automatically. Removed and recreated on each
call, so any existing content is overwritten.
| Strategy | Description |
|---|---|
random | Theorems are shuffled with seed 3407, then split 60 / 20 / 20. |
novel_premises | Theorems are assigned to val/test if their proofs use premises that appear least frequently across the full set, ensuring training-set premises differ from evaluation-set premises. |
Persistence Methods
to_json()
Serializes the entire database (all repositories and their theorems) to JSON.
Target file path. Defaults to
self.json_path set at construction time.from_json()
Class method — creates a new DynamicDatabase instance and loads repositories from a JSON file.
Path to a previously saved
DynamicDatabase JSON file.A fully populated
DynamicDatabase instance.from_dict()
Class method — reconstructs a DynamicDatabase from a plain Python dictionary (e.g., after json.load()).
Dictionary with a
"repositories" key containing a list of serialized
Repository dicts.A fully populated
DynamicDatabase instance.update_json()
Loads an existing JSON file, applies update_repository() for every repository currently in self.repositories, and writes the result back. Useful for patching individual fields without discarding unrelated entries.
Path of the JSON file to update. A new database is created at this path if the
file does not exist.
Curriculum Methods
sort_repositories_by_difficulty()
Assigns a numeric difficulty rating to every theorem in the database, categorises them as Easy, Medium, or Hard using the 33rd and 67th percentile thresholds of the full difficulty distribution, and returns repositories sorted by their Easy theorem count in descending order (most approachable first).
Repositories ordered most-easy-first (descending count of Easy-rated theorems).
Internally the function also updates each theorem’s
difficulty_rating field
and persists the updated database via to_json().discover_repositories()
Searches GitHub for popular Lean 4 repositories, filters known noisy or incompatible projects via the KNOWN_REPOSITORIES exclusion list, traces each candidate, and optionally sorts the result by difficulty for curriculum learning.
Number of repositories to search for and attempt to trace.
When
True, calls sort_repositories_by_difficulty() before returning so
the output list is curriculum-ordered (easy-first).Passed through to
trace_repository() for each discovered project.A list of
LeanGitRepo objects representing successfully traced repositories,
filtered to compatible Lean 4 versions and ordered by difficulty when
curriculum_learning=True.Complete Example
The following example mirrorsexamples/data_generation/trace_github.py and examples/data_generation/discover_github.py:
The Repository Object
Repository is a dataclass returned by trace_repository() and stored inside DynamicDatabase.repositories. Its key read-only computed properties are:
| Property | Type | Description |
|---|---|---|
total_theorems | int | num_proven_theorems + num_sorry_theorems |
num_proven_theorems | int | Count of theorems with complete proofs |
num_sorry_theorems | int | Count of theorems using sorry (proved + unproved) |
num_premise_files | int | Number of files contributing premises |
num_premises | int | Total premise definitions across all files |
num_files_traced | int | Number of .lean files that were traced |
get_all_theorems | List[Theorem] | Concatenation of all three theorem lists |
repo.get_theorem(full_name, file_path) and updated with repo.update_theorem(theorem).