Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

LeanAgent is the production-grade, lifelong-learning agent introduced in the ICLR 2025 paper. Rather than fine-tuning a generative model with SFT, it trains a retrieval model — kaiyuy/leandojo-lean4-retriever-byt5-small by default — that learns to surface relevant premises from an ever-growing indexed corpus. Repositories are added incrementally, the retriever is updated with Elastic Weight Consolidation (EWC) to avoid catastrophic forgetting, and a separate generation model uses the retrieved premises to construct proofs at evaluation time.

Class Definition

lean_dojo_v2.agent.lean_agent.LeanAgent(BaseAgent)
LeanAgent extends BaseAgent. It builds Lean dependencies during tracing (_get_build_deps() returns True) for full premise coverage, and it initialises a RetrievalProver using the latest checkpoint found by find_latest_checkpoint() when prove() is called.

Constructor

LeanAgent(
    database_path: str = "dynamic_database.json",
    training_config: Optional[TrainingConfig] = None,
    prover_config: Optional[ProverConfig] = None,
)
database_path
str
default:"dynamic_database.json"
Path to the JSON file used for DynamicDatabase persistence. Shared with the base class.
training_config
Optional[TrainingConfig]
default:"TrainingConfig()"
Training hyperparameters for the retrieval model. Key fields include model_name (defaults to "kaiyuy/leandojo-lean4-retriever-byt5-small"), lambda_value for EWC regularisation strength, max_epochs, lr, and run_progressive_training. Load from YAML with TrainingConfig.from_yaml().
prover_config
Optional[ProverConfig]
default:"ProverConfig()"
Prover runtime settings including timeout, num_sampled_tactics, num_workers, and use_vllm. Defaults to a ProverConfig with sensible out-of-the-box values.

Behaviour

Method / AttributeDetail
_get_build_deps()Always returns True. Lean dependency trees are traced to maximise available premises.
self.trainerA RetrievalTrainer(config=self.config) created at construction time.
_setup_prover()Calls find_latest_checkpoint() for the retriever path, then constructs RetrievalProver(ret_ckpt_path, gen_ckpt_path, indexed_corpus_path).
gen_ckpt_pathResolved as os.path.join(RAID_DIR, "model_lightning.ckpt").
indexed_corpus_pathResolved as os.path.join(self.data_path, "corpus.jsonl").
build_deps=True substantially increases repository tracing time and disk usage. Ensure sufficient free space under RAID_DIR before registering large repositories such as leanprover/lean4.
LeanAgent is the pipeline described in the ICLR 2025 paper LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. It is the recommended agent for research experiments that require lifelong, progressive learning across many repositories.

Example

The following example is adapted from examples/lean_agent.py. It registers two repositories, trains the retrieval model progressively, then attempts to prove all sorry theorems:
from lean_dojo_v2.agent.lean_agent import LeanAgent

agent = LeanAgent()

agent.setup_github_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="b14fef0ceca29a65bc3122bf730406b33c7effe5",
)
agent.setup_github_repository(
    url="https://github.com/leanprover/lean4",
    commit="995161396a25153797460473262555e1792f4823",
)

agent.train()
agent.prove()
Repositories are sorted by difficulty before training, so the retriever is progressively exposed to harder material, reducing catastrophic forgetting of earlier premises.

Custom Training Configuration

Override specific TrainingConfig fields to tune the EWC regularisation strength, learning rate, or number of training epochs:
from lean_dojo_v2.agent.lean_agent import LeanAgent
from lean_dojo_v2.lean_agent.config import TrainingConfig

config = TrainingConfig(
    lambda_value=0.5,       # stronger EWC regularisation
    max_epochs=3,
    lr=5e-4,
    num_retrieved=200,      # retrieve more premises per theorem
)

agent = LeanAgent(training_config=config)
agent.setup_github_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="b14fef0ceca29a65bc3122bf730406b33c7effe5",
)
agent.train()
agent.prove()
Use TrainingConfig.from_yaml("config.yaml") or TrainingConfig.from_json("config.json") to manage experiment configurations as files rather than inline code, making hyperparameter sweeps easier to track.

Build docs developers (and LLMs) love