LeanDojo v2 is an end-to-end Python framework that automates the full loop of AI-assisted theorem proving in Lean 4 — from tracing repositories and building structured datasets to fine-tuning language models and running proof search via a Pantograph-backed RPC server. Whether you want to fine-tune a compact 7B model on your own Lean codebase or delegate proving to a 671B parameter cloud model, LeanDojo v2 provides a single, cohesive toolkit to do it.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
The Problem It Solves
Writing formal proofs in Lean 4 is powerful but painstaking. A theorem markedsorry compiles, yet remains unproven — and large codebases can accumulate hundreds of them. Closing those gaps manually requires deep expertise in both the subject matter and Lean’s tactic language. LeanDojo v2 automates every stage of this process:
- Clone & Trace — Clone any Lean 4 repository from GitHub (or locally), then instrument it with the custom
ExtractData.leanmodule to capture theorem states, tactic sequences, and sorry positions. - Build Datasets — Convert traced artifacts to JSONL training data and load them into a
DynamicDatabasethat tracks curriculum difficulty and sorry status. - Train — Fine-tune any Hugging Face causal LM with
SFTTrainerorGRPOTrainer, or train a dense retriever withRetrievalTrainer. - Prove — Drive Pantograph’s Lean RPC server to search for tactics or generate whole proofs with
HFProver,ExternalProver, orRetrievalProver.
Key Components
Agents
Agents are the top-level orchestrators. Each agent pairs a trainer with a compatible prover and exposes a three-method interface:setup_github_repository(), train(), and prove().
| Agent | Best For |
|---|---|
HFAgent | Fine-tuning a local Hugging Face model and running HFProver |
LeanAgent | Lifelong retrieval-augmented learning across many repositories |
ExternalAgent | Cloud inference (e.g. DeepSeek-Prover-V2-671B) without a local GPU |
Trainers
| Trainer | Method |
|---|---|
SFTTrainer | LoRA-enabled supervised fine-tuning |
GRPOTrainer | Group Relative Policy Optimization |
RetrievalTrainer | Dense retriever curriculum training |
ProgressTrainer | Regression head for step-remaining prediction |
Provers
| Prover | Backing |
|---|---|
HFProver | Local Hugging Face checkpoint (full or LoRA) |
ExternalProver | Hugging Face Inference API |
RetrievalProver | Retrieval-augmented tactic generation |
Supporting Infrastructure
DynamicDatabase— Tracks repositories, theorems, curriculum difficulty, and sorry status to enable lifelong training schedules without re-tracing.lean_dojotracing pipeline —lean.py,trace.py,dataset.py, andcache.pyhandle cloning, AST normalization, JSONL export, and metadata caching.- External API — A FastAPI + uvicorn server (
external_api/) exposes HTTP endpoints so Lean editors can query LLMs directly.
System Requirements
LeanDojo v2 requires Python ≥ 3.11, a CUDA-capable GPU (tested with CUDA 12.6), and the elan Lean toolchain manager. A
GITHUB_ACCESS_TOKEN with repo and read:org scopes is required at import time — the framework raises a ValueError if it is missing.Explore LeanDojo v2
Quickstart
Trace a repository, fine-tune a model, and prove sorry theorems in under 15 lines of Python.
Installation
Install from PyPI or source, configure environment variables, and set up elan.
Agents Guide
Learn when to use HFAgent, LeanAgent, or ExternalAgent and how to extend BaseAgent.
API Reference
Complete method signatures and parameter documentation for every public class.
Key Features
- Unified agent abstraction —
BaseAgentdefines a single interface; all three concrete agents implement the samesetup_github_repository/train/provecontract. - LoRA-enabled fine-tuning — Pass a
peft.LoraConfigtoSFTTrainerorGRPOTrainerto train only adapter weights and keep GPU memory manageable. - Pantograph RPC integration — All provers communicate with Lean 4 through Pantograph’s server, enabling stateful goal tracking and tactic validation without re-compiling.
- Lifelong learning —
LeanAgentmaintains a curriculum sorted by difficulty and extends theDynamicDatabaseas new repositories are added, without re-tracing existing work. - No-GPU path —
ExternalAgentdelegates every inference call to the Hugging Face Inference API, making it possible to run proof searches from a CPU-only machine. - Whole-proof or step-by-step — Choose between generating a complete proof in one forward pass or searching tactically through a goal tree.
- MIT Licensed — Free to use, modify, and distribute.