Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

LeanDojo v2 is built around a four-layer pipeline that takes a raw Lean 4 repository all the way through structured theorem data, model training, and automated proof search. Each layer has clean boundaries so you can reuse any slice — run the tracer without an agent, swap in a custom trainer, or point an existing model at a Pantograph server without touching the database. Understanding how the layers connect is the fastest way to orient yourself in the codebase.

The Four-Layer Model

Lean 4 repo  ──►  Trace  ──►  Store  ──►  Train  ──►  Prove
               (lean_dojo)  (database)  (trainer)   (prover)
1

Trace — instrument and extract

The lean_dojo_v2.lean_dojo sub-package clones a repository, runs lake build, then injects the custom ExtractData.lean module and re-runs lake env lean --run ExtractData.lean to capture .ast.json and .dep_paths artifacts for every compiled Lean file. The result is a TracedRepo object that holds typed wrappers around every theorem, tactic, premise, and AST node.Key classes: LeanGitRepo (describes a repo by URL + commit), trace() (the public entry-point function), TracedRepo, TracedFile, TracedTheorem, TracedTactic.
2

Store — manage the dynamic database

DynamicDatabase is a JSON-backed persistent store. After tracing, it records every Repository object along with its theorems (categorised as proven, sorry-proved, or sorry-unproved), premise files, traced file paths, and difficulty ratings. It also exposes export_merged_data() to produce the JSONL datasets that trainers consume.Key classes: DynamicDatabase, Repository, Theorem, PremiseFile.
3

Train — fine-tune models on theorem data

The trainer layer reads the exported datasets and fine-tunes language models or retrieval encoders. All trainers integrate with Hugging Face Transformers, and the supervised and RL trainers support LoRA via PEFT and DeepSpeed for multi-GPU runs.Key classes: SFTTrainer (supervised fine-tuning), GRPOTrainer (group-relative policy optimisation), RetrievalTrainer (ByT5-based dense retrieval), ProgressTrainer (step-count regression head).
4

Prove — search for proofs via Pantograph RPC

The prover layer opens a Pantograph Lean RPC server and iteratively applies tactics until the goal list is empty. All provers share a common search interface: the difference is how they generate candidate tactics — from a local HF checkpoint, from the HF Inference API, or from a retrieval-augmented tactic generator.Key classes: HFProver, ExternalProver, RetrievalProver.

Agent Abstractions

BaseAgent orchestrates the entire pipeline. It wraps DynamicDatabase, calls trace_repository() during setup, delegates to a trainer during train(), and delegates to a prover during prove(). Subclasses override _get_build_deps() to control whether dependency packages are traced, and _setup_prover() to wire in the right prover implementation.

Agent-Trainer-Prover Pairings

AgentTrainerProverNotes
HFAgentSFTTrainer / GRPOTrainerHFProverLocal GPU inference; LoRA-ready
LeanAgentRetrievalTrainerRetrievalProverLifelong learning; build_deps=True required
ExternalAgent(none)ExternalProverHF Inference API; no local training
ExternalAgent skips the training step entirely and delegates proof search directly to the HF Inference API (default: DeepSeek-Prover-V2-671B). It is the fastest way to prove theorems without local GPU resources.

HFAgent Example

from lean_dojo_v2.agent.hf_agent import HFAgent
from lean_dojo_v2.trainer.sft_trainer import SFTTrainer

trainer = SFTTrainer(
    model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
    output_dir="outputs-deepseek",
    epochs_per_repo=1,
    batch_size=2,
    lr=2e-5,
)

agent = HFAgent(trainer=trainer)
agent.setup_github_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)
agent.train()
agent.prove()

LeanAgent Example

from lean_dojo_v2.agent.lean_agent import LeanAgent

agent = LeanAgent()
agent.setup_github_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2",
)
agent.train()   # trains the retriever
agent.prove()   # retrieval-augmented best-first search

External API Layer

The lean_dojo_v2/external_api/ folder exposes a FastAPI + uvicorn HTTP server alongside Lean frontend snippets that allow Lean editor plugins (via LeanCopilot) to query large language models for tactic suggestions. This layer is independent of the agent pipeline — you can run it as a standalone service pointed at any OpenAI-compatible inference backend.
The external API server is configured via environment variables such as LEANPROGRESS_MODEL (model path), LEANPROGRESS_DEVICE (inference device), and LEANPROGRESS_TEMPLATE (prompt template). Once the server is running, Lean editor plugins can route tactic suggestion requests to it.

Component Map

Tracing

How LeanGitRepo, trace(), and ExtractData.lean instrument a Lean 4 repository to extract proof states, ASTs, and premises.

Database

How DynamicDatabase stores repositories, theorems, premise files, and sorry status across lifelong training runs.

Curriculum Learning

How difficulty scores and percentile thresholds sort repositories and theorems into progressive training schedules.

Quickstart Guide

End-to-end example: trace a repo, build a dataset, fine-tune a model, and prove sorry theorems in minutes.

Build docs developers (and LLMs) love