LeanDojo v2 Trainers: SFT, GRPO, Retrieval and Progress

LeanDojo v2 ships four trainer classes, each built on HuggingFace Transformers and TRL, covering the main training objectives for AI-assisted theorem proving. SFTTrainer performs supervised fine-tuning on tactic prediction; GRPOTrainer applies Group Relative Policy Optimization for RL-style refinement; RetrievalTrainer trains a dense premise retriever for RAG-based agents; and ProgressTrainer fine-tunes a regression head that predicts the number of proof steps remaining. All trainers save their checkpoints to a configurable output_dir, which the corresponding provers load at inference time.

Training with SFTTrainer, GRPOTrainer, and RetrievalTrainer requires a CUDA-capable GPU. CPU-only execution is supported but is extremely slow and not recommended for anything beyond debugging. Ensure your environment matches the CUDA version used when installing PyTorch (tested with CUDA 12.6).

SFTTrainer

SFTTrainer runs supervised fine-tuning on traced tactic data extracted from your Lean repositories. It wraps TRL’s SFTTrainer and an internal SFTDataset that reads random/train.json, filters out any sorry tactics, and formats each remaining tactic as a three-turn chat message (system → user → assistant).

How data is formatted

Each tactic in the traced JSON becomes one training example:

Role	Content
`system`	Lean 4 tactic generation instructions (no `sorry`, single line only)
`user`	The goal state before the tactic (`state_before`), marks stripped
`assistant`	The tactic text, first line only, marks stripped

Constructor

SFTTrainer(
    model_name: str,                      # HuggingFace model ID or local path
    output_dir: str = "outputs",          # checkpoint destination
    epochs_per_repo: int = 1,             # training epochs per curriculum step
    batch_size: int = 1,
    lr: float = 2e-5,
    lora_config: Optional[LoraConfig] = None,  # pass to enable LoRA
)

Example: SFT with LoRA (`examples/sft.py`)

from lean_dojo_v2.agent.hf_agent import HFAgent
from lean_dojo_v2.trainer.sft_trainer import SFTTrainer

def main() -> None:
    url = "https://github.com/durant42040/lean4-example"
    commit = "b14fef0ceca29a65bc3122bf730406b33c7effe5"

    trainer = SFTTrainer(
        model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
        output_dir="outputs-deepseek",
        epochs_per_repo=1,
        batch_size=2,
        lr=2e-5,
    )

    agent = HFAgent(trainer=trainer)
    agent.setup_github_repository(url=url, commit=commit)
    agent.train()
    agent.prove()

if __name__ == "__main__":
    main()

LoRA configuration

Passing a peft.LoraConfig to SFTTrainer enables parameter-efficient fine-tuning. The trainer calls get_peft_model() internally and prints trainable parameter counts. Only the adapter weights are saved to output_dir.

from peft import LoraConfig
from lean_dojo_v2.trainer.sft_trainer import SFTTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
)

trainer = SFTTrainer(
    model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
    output_dir="outputs-deepseek-lora",
    epochs_per_repo=1,
    batch_size=2,
    lr=2e-5,
    lora_config=lora_config,
)

When LoRA is active, SFTTrainer saves only the adapter weights via model.save_pretrained(output_dir). HFAgent detects lora_config is not None and passes use_lora=True to HFProver, which loads the adapter with AutoPeftModelForCausalLM. Keep the base model accessible on disk or via the HuggingFace Hub so the prover can reconstruct the full model at load time.

GRPOTrainer

GRPOTrainer implements Group Relative Policy Optimization, an RL-style training method that refines tactic generation using a reward signal you define. Like SFTTrainer, it reads from random/train.json and filters sorry tactics, but it formats each example as a prompt-only sequence (no assistant turn) and relies on your reward_func to score completions.

Constructor

GRPOTrainer(
    model_name: str,
    reward_func,                          # required: callable(completions, **kwargs) -> torch.Tensor
    output_dir: str = "outputs",
    epochs_per_repo: int = 1,
    batch_size: int = 1,
    lr: float = 2e-5,
    lora_config: Optional[LoraConfig] = None,
)

Reward function signature

The reward function receives a list of model completions and any keyword arguments forwarded by the TRL GRPOTrainer. It must return a torch.Tensor of scalar rewards with one element per completion.

import torch

def reward_func(completions, **kwargs) -> torch.Tensor:
    # completions: list[str]  — one decoded string per sample in the batch
    return torch.tensor([1.0] * len(completions))

Example: GRPO training (`examples/grpo.py`)

import torch

from lean_dojo_v2.agent.hf_agent import HFAgent
from lean_dojo_v2.trainer.grpo_trainer import GRPOTrainer


def reward_func(completions, **kwargs):
    return torch.tensor([1.0] * len(completions))


def main() -> None:
    url = "https://github.com/durant42040/lean4-example"
    commit = "b14fef0ceca29a65bc3122bf730406b33c7effe5"

    trainer = GRPOTrainer(
        model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
        output_dir="outputs-deepseek",
        reward_func=reward_func,
        epochs_per_repo=1,
        batch_size=8,
        lr=2e-5,
    )

    agent = HFAgent(trainer=trainer)
    agent.setup_github_repository(url=url, commit=commit)
    agent.train()
    agent.prove()


if __name__ == "__main__":
    main()

GRPOTrainer is paired with HFAgent, not a dedicated GRPOAgent. Simply pass your GRPOTrainer instance to HFAgent(trainer=...) just as you would an SFTTrainer. The agent uses trainer.output_dir and trainer.lora_config to configure HFProver automatically.

RetrievalTrainer

RetrievalTrainer trains the dense premise retriever used by the LeanAgent lifelong-learning pipeline. It is built on PyTorch Lightning and wraps the PremiseRetriever model, using a RetrievalDataModule that loads tactic/premise pairs from the merged corpus produced by DynamicDatabase. Unlike SFTTrainer and GRPOTrainer, you do not instantiate RetrievalTrainer directly in most workflows — LeanAgent creates and manages it for you.

Constructor

RetrievalTrainer(
    config: Optional[TrainingConfig] = None,
)

TrainingConfig carries all hyperparameters: learning rate, batch sizes, number of GPUs, seed, Lambda weighting, and logging directories. Pass a customised TrainingConfig when you need to override defaults.

How it trains

Export merged data

For each repository in the difficulty-sorted curriculum, RetrievalTrainer calls database.export_merged_data() to produce a merged corpus.jsonl and split JSON files under random/.

Load the retriever checkpoint

Calls find_latest_checkpoint() to locate the most recent PremiseRetriever Lightning checkpoint. The checkpoint is loaded in training mode (unfrozen).

Fit the model

A Lightning Trainer is created via config.create_trainer() and trainer.fit() is called with the RetrievalDataModule. The trainer logs to a per-repository subdirectory.

Evaluation metrics

After training, call trainer.evaluate() to compute retrieval metrics on the held-out test split. The method reports:

Metric	Description
R@1	Recall at rank 1 (%)
R@10	Recall at rank 10 (%)
MRR	Mean Reciprocal Rank

ProgressTrainer

ProgressTrainer fine-tunes an encoder model (default bert-base-uncased) with a single-output regression head to predict the number of proof steps remaining given a goal state, a tactic prefix, and a candidate tactic. This is the LeanProgress step-prediction component.

Constructor

ProgressTrainer(
    data_path: str,                         # path to a JSONL dataset
    model_name: str = "bert-base-uncased",
    output_dir: str = "outputs",
    max_length: int = 512,
    batch_size: int = 8,
    epochs: float = 3.0,
    learning_rate: float = 1e-5,
    eval_ratio: float = 0.2,
    seed: int = 42,
)

Input format

The JSONL dataset must have one JSON object per line, each containing:

{
  "goal": "⊢ n + 0 = n",
  "prefix": "induction n with",
  "tactic": "rfl",
  "steps_remaining": 2
}

goal, tactic, and steps_remaining are required; prefix is optional and concatenated with goal and tactic to form the model input.

Example: LeanProgress training

Generate the dataset

python -m lean_dojo_v2.lean_progress.create_sample_dataset \
  --output raid/data/sample_leanprogress_dataset.jsonl

Train the regression model

from pathlib import Path
from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer

sample_dataset_path = Path("raid/data/sample_leanprogress_dataset.jsonl")

trainer = ProgressTrainer(
    model_name="bert-base-uncased",
    data_path=str(sample_dataset_path),
    output_dir="outputs-progress",
)

trainer.train()

Set the environment variable

export LEANPROGRESS_MODEL=outputs-progress

Once this variable is set, the LeanProgress component uses your trained model to score candidate tactics during proof search.

Checkpoint Output

SFTTrainer and GRPOTrainer both save checkpoints to output_dir after every curriculum step. HFProver loads from this same directory at prove time — make sure you do not change output_dir between training and proving calls. With LoRA, only adapter weights are stored; the base model must remain separately accessible.

Agents

See how agents wire trainers and provers together into a complete experiment pipeline.

Provers

Understand how HFProver, RetrievalProver, and ExternalProver consume trainer checkpoints for inference.

Get Started

Concepts

Guides

Reference

LeanDojo v2 Trainers: SFT, GRPO, Retrieval and Progress

SFTTrainer

How data is formatted

Constructor

Example: SFT with LoRA (`examples/sft.py`)

LoRA configuration

GRPOTrainer

Constructor

Reward function signature

Example: GRPO training (`examples/grpo.py`)

RetrievalTrainer

Constructor

How it trains

Evaluation metrics

ProgressTrainer

Constructor

Input format

Example: LeanProgress training

Checkpoint Output

Agents

Provers

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

Documentation Index

​SFTTrainer

​How data is formatted

​Constructor

​Example: SFT with LoRA (examples/sft.py)

​LoRA configuration

​GRPOTrainer

​Constructor

​Reward function signature

​Example: GRPO training (examples/grpo.py)

​RetrievalTrainer

​Constructor

​How it trains

​Evaluation metrics

​ProgressTrainer

​Constructor

​Input format

​Example: LeanProgress training

​Checkpoint Output

Agents

Provers

Build docs developers (and LLMs) love

SFTTrainer

How data is formatted

Constructor

Example: SFT with LoRA (`examples/sft.py`)

LoRA configuration

GRPOTrainer

Constructor

Reward function signature

Example: GRPO training (`examples/grpo.py`)

RetrievalTrainer

Constructor

How it trains

Evaluation metrics

ProgressTrainer

Constructor

Input format

Example: LeanProgress training

Checkpoint Output