LeanDojo v2: AI Theorem Proving Framework for Lean 4

LeanDojo v2 is an end-to-end Python framework that automates the full loop of AI-assisted theorem proving in Lean 4 — from tracing repositories and building structured datasets to fine-tuning language models and running proof search via a Pantograph-backed RPC server. Whether you want to fine-tune a compact 7B model on your own Lean codebase or delegate proving to a 671B parameter cloud model, LeanDojo v2 provides a single, cohesive toolkit to do it.

The Problem It Solves

Writing formal proofs in Lean 4 is powerful but painstaking. A theorem marked sorry compiles, yet remains unproven — and large codebases can accumulate hundreds of them. Closing those gaps manually requires deep expertise in both the subject matter and Lean’s tactic language. LeanDojo v2 automates every stage of this process:

Clone & Trace — Clone any Lean 4 repository from GitHub (or locally), then instrument it with the custom ExtractData.lean module to capture theorem states, tactic sequences, and sorry positions.
Build Datasets — Convert traced artifacts to JSONL training data and load them into a DynamicDatabase that tracks curriculum difficulty and sorry status.
Train — Fine-tune any Hugging Face causal LM with SFTTrainer or GRPOTrainer, or train a dense retriever with RetrievalTrainer.
Prove — Drive Pantograph’s Lean RPC server to search for tactics or generate whole proofs with HFProver, ExternalProver, or RetrievalProver.

The framework is deliberately modular: you can use the tracing pipeline without any agent, swap in a custom trainer, or expose your own inference service through the built-in external API layer.

Key Components

Agents

Agents are the top-level orchestrators. Each agent pairs a trainer with a compatible prover and exposes a three-method interface: setup_github_repository(), train(), and prove().

Agent	Best For
`HFAgent`	Fine-tuning a local Hugging Face model and running `HFProver`
`LeanAgent`	Lifelong retrieval-augmented learning across many repositories
`ExternalAgent`	Cloud inference (e.g. DeepSeek-Prover-V2-671B) without a local GPU

Trainers

Trainer	Method
`SFTTrainer`	LoRA-enabled supervised fine-tuning
`GRPOTrainer`	Group Relative Policy Optimization
`RetrievalTrainer`	Dense retriever curriculum training
`ProgressTrainer`	Regression head for step-remaining prediction

Provers

Prover	Backing
`HFProver`	Local Hugging Face checkpoint (full or LoRA)
`ExternalProver`	Hugging Face Inference API
`RetrievalProver`	Retrieval-augmented tactic generation

Supporting Infrastructure

DynamicDatabase — Tracks repositories, theorems, curriculum difficulty, and sorry status to enable lifelong training schedules without re-tracing.
lean_dojo tracing pipeline — lean.py, trace.py, dataset.py, and cache.py handle cloning, AST normalization, JSONL export, and metadata caching.
External API — A FastAPI + uvicorn server (external_api/) exposes HTTP endpoints so Lean editors can query LLMs directly.

System Requirements

LeanDojo v2 requires Python ≥ 3.11, a CUDA-capable GPU (tested with CUDA 12.6), and the elan Lean toolchain manager. A GITHUB_ACCESS_TOKEN with repo and read:org scopes is required at import time — the framework raises a ValueError if it is missing.

Full installation instructions, including environment variable configuration, are on the Installation page.

Explore LeanDojo v2

Quickstart

Trace a repository, fine-tune a model, and prove sorry theorems in under 15 lines of Python.

Installation

Install from PyPI or source, configure environment variables, and set up elan.

Agents Guide

Learn when to use HFAgent, LeanAgent, or ExternalAgent and how to extend BaseAgent.

API Reference

Complete method signatures and parameter documentation for every public class.

Key Features

Unified agent abstraction — BaseAgent defines a single interface; all three concrete agents implement the same setup_github_repository / train / prove contract.
LoRA-enabled fine-tuning — Pass a peft.LoraConfig to SFTTrainer or GRPOTrainer to train only adapter weights and keep GPU memory manageable.
Pantograph RPC integration — All provers communicate with Lean 4 through Pantograph’s server, enabling stateful goal tracking and tactic validation without re-compiling.
Lifelong learning — LeanAgent maintains a curriculum sorted by difficulty and extends the DynamicDatabase as new repositories are added, without re-tracing existing work.
No-GPU path — ExternalAgent delegates every inference call to the Hugging Face Inference API, making it possible to run proof searches from a CPU-only machine.
Whole-proof or step-by-step — Choose between generating a complete proof in one forward pass or searching tactically through a goal tree.
MIT Licensed — Free to use, modify, and distribute.

Get Started

Concepts

Guides

Reference

LeanDojo v2: AI Theorem Proving Framework for Lean 4

The Problem It Solves

Key Components

Agents

Trainers

Provers

Supporting Infrastructure

System Requirements

Explore LeanDojo v2

Quickstart

Installation

Agents Guide

API Reference

Key Features

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

Documentation Index

​The Problem It Solves

​Key Components

​Agents

​Trainers

​Provers

​Supporting Infrastructure

​System Requirements

​Explore LeanDojo v2

Quickstart

Installation

Agents Guide

API Reference

​Key Features

Build docs developers (and LLMs) love

The Problem It Solves

Key Components

Agents

Trainers

Provers

Supporting Infrastructure

System Requirements

Explore LeanDojo v2

Key Features