LeanDojo v2 ships four trainer classes, each built on HuggingFace Transformers and TRL, covering the main training objectives for AI-assisted theorem proving.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
SFTTrainer performs supervised fine-tuning on tactic prediction; GRPOTrainer applies Group Relative Policy Optimization for RL-style refinement; RetrievalTrainer trains a dense premise retriever for RAG-based agents; and ProgressTrainer fine-tunes a regression head that predicts the number of proof steps remaining. All trainers save their checkpoints to a configurable output_dir, which the corresponding provers load at inference time.
SFTTrainer
SFTTrainer runs supervised fine-tuning on traced tactic data extracted from your Lean repositories. It wraps TRL’s SFTTrainer and an internal SFTDataset that reads random/train.json, filters out any sorry tactics, and formats each remaining tactic as a three-turn chat message (system → user → assistant).
How data is formatted
Each tactic in the traced JSON becomes one training example:| Role | Content |
|---|---|
system | Lean 4 tactic generation instructions (no sorry, single line only) |
user | The goal state before the tactic (state_before), marks stripped |
assistant | The tactic text, first line only, marks stripped |
Constructor
Example: SFT with LoRA (examples/sft.py)
LoRA configuration
Passing apeft.LoraConfig to SFTTrainer enables parameter-efficient
fine-tuning. The trainer calls get_peft_model() internally and prints
trainable parameter counts. Only the adapter weights are saved to output_dir.
When LoRA is active,
SFTTrainer saves only the adapter weights via
model.save_pretrained(output_dir). HFAgent detects lora_config is not None
and passes use_lora=True to HFProver, which loads the adapter with
AutoPeftModelForCausalLM. Keep the base model accessible on disk or via the
HuggingFace Hub so the prover can reconstruct the full model at load time.GRPOTrainer
GRPOTrainer implements Group Relative Policy Optimization, an RL-style training method that refines tactic generation using a reward signal you define. Like SFTTrainer, it reads from random/train.json and filters sorry tactics, but it formats each example as a prompt-only sequence (no assistant turn) and relies on your reward_func to score completions.
Constructor
Reward function signature
The reward function receives a list of model completions and any keyword arguments forwarded by the TRLGRPOTrainer. It must return a
torch.Tensor of scalar rewards with one element per completion.
Example: GRPO training (examples/grpo.py)
RetrievalTrainer
RetrievalTrainer trains the dense premise retriever used by the LeanAgent
lifelong-learning pipeline. It is built on PyTorch Lightning and wraps the
PremiseRetriever model, using a RetrievalDataModule that loads tactic/premise
pairs from the merged corpus produced by DynamicDatabase.
Unlike SFTTrainer and GRPOTrainer, you do not instantiate RetrievalTrainer
directly in most workflows — LeanAgent creates and manages it for you.
Constructor
TrainingConfig carries all hyperparameters: learning rate, batch sizes,
number of GPUs, seed, Lambda weighting, and logging directories. Pass a
customised TrainingConfig when you need to override defaults.
How it trains
Export merged data
For each repository in the difficulty-sorted curriculum,
RetrievalTrainer
calls database.export_merged_data() to produce a merged corpus.jsonl
and split JSON files under random/.Load the retriever checkpoint
Calls
find_latest_checkpoint() to locate the most recent
PremiseRetriever Lightning checkpoint. The checkpoint is loaded in
training mode (unfrozen).Evaluation metrics
After training, calltrainer.evaluate() to compute retrieval metrics on the
held-out test split. The method reports:
| Metric | Description |
|---|---|
| R@1 | Recall at rank 1 (%) |
| R@10 | Recall at rank 10 (%) |
| MRR | Mean Reciprocal Rank |
ProgressTrainer
ProgressTrainer fine-tunes an encoder model (default bert-base-uncased)
with a single-output regression head to predict the number of proof steps
remaining given a goal state, a tactic prefix, and a candidate tactic. This
is the LeanProgress step-prediction component.
Constructor
Input format
The JSONL dataset must have one JSON object per line, each containing:goal, tactic, and steps_remaining are required; prefix is optional and
concatenated with goal and tactic to form the model input.
Example: LeanProgress training
Checkpoint Output
SFTTrainer and GRPOTrainer both save checkpoints to output_dir after
every curriculum step. HFProver loads from this same directory at prove
time — make sure you do not change output_dir between training and proving
calls. With LoRA, only adapter weights are stored; the base model must remain
separately accessible.Agents
See how agents wire trainers and provers together into a complete
experiment pipeline.
Provers
Understand how
HFProver, RetrievalProver, and ExternalProver consume
trainer checkpoints for inference.