This guide takes you from a fresh LeanDojo v2 installation to a working proof search in the shortest possible path. You will trace a real Lean 4 GitHub repository, build a supervised fine-tuning dataset from its sorry theorems, train a DeepSeek-Prover model on that dataset, and watch the prover close those sorrys automatically. If you do not have a GPU available, the final step shows how to do the same thing using the Hugging Face Inference API with no local model loading.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
This guide assumes LeanDojo v2, Pantograph, and PyTorch are already installed and that
GITHUB_ACCESS_TOKEN is exported. If not, complete the Installation page first.Export your GitHub access token
The tracing pipeline calls the GitHub API to clone repositories and resolve commits. Export your token before doing anything else:
Train and prove with HFAgent + SFTTrainer
This is the core workflow. Save this as
HFAgent orchestrates three operations in sequence: tracing the repository, fine-tuning the model, and running proof search.run_sft.py and execute it with python run_sft.py. The first run downloads and traces the repository, which may take several minutes depending on your internet connection and CPU count.Try the ExternalAgent (no GPU required)
ExternalAgent delegates all inference to the Hugging Face Inference API, so no local model weights are loaded. This is ideal for quick experiments or when you need the raw capacity of a 671B parameter model.whole_proof=True flag instructs ExternalProver to generate a complete proof in one forward pass rather than searching through a tactic tree step by step — this is faster and works well with very capable models.Set
HF_TOKEN in your environment if your chosen model requires authentication: export HF_TOKEN=hf_your_token_here.Understand what happened
Whether you used
HFAgent or ExternalAgent, the framework executed the same pipeline under the hood:- Clone & trace —
setup_github_repository()cloned the target commit and ran Lean 4 with the instrumentedExtractData.leanmodule, capturing every theorem’s proof state, tactic sequence, and sorry location. - Build dataset — Traced artifacts were converted to JSONL records and loaded into a
DynamicDatabase, sorted by curriculum difficulty. - Fine-tune (
HFAgentonly) —SFTTrainerran supervised fine-tuning on the sorry theorems, writing a checkpoint tooutputs-deepseek/. - Proof search — The prover (either
HFProverloading the local checkpoint, orExternalProvercalling the API) communicated with Pantograph’s Lean RPC server, generating tactics or whole proofs and validating each against the live Lean kernel until every sorry was closed or the budget was exhausted.
Exploring Further Examples
Theexamples/ directory in the repository contains ready-to-run scripts for every major workflow:
Next Steps
Agents Guide
Deep dive into HFAgent, LeanAgent, and ExternalAgent. Learn how to extend BaseAgent with your own trainer/prover pair.
Trainers Guide
Explore SFTTrainer, GRPOTrainer, and RetrievalTrainer — including LoRA configuration, DeepSpeed integration, and reward functions.
Architecture
Understand how the tracing pipeline, DynamicDatabase, agents, trainers, and provers fit together end to end.
Troubleshooting
Solutions for common issues: GitHub 401 errors, Lean tracing failures, missing CUDA libraries, and Pantograph errors.