Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

GRPOTrainer brings reinforcement-learning-style fine-tuning to Lean 4 tactic generation via TRL’s GRPOTrainer (Group Relative Policy Optimization). Instead of learning from ground-truth labels alone, the model is rewarded according to a caller-supplied reward_func — making it straightforward to inject domain knowledge such as proof-length penalties, tactic novelty bonuses, or verified-proof signals. The training loop mirrors SFTTrainer: repos are processed in order, the merged dataset grows with each iteration, and a checkpoint is written after every repository.

Class

lean_dojo_v2.trainer.grpo_trainer.GRPOTrainer

Constructor Parameters

model_name
str
required
HuggingFace model identifier or local path loaded with AutoModelForCausalLM.from_pretrained in bfloat16. Example: "deepseek-ai/DeepSeek-Prover-V2-7B".
reward_func
Callable
required
A callable with signature:
def reward_func(completions: list, **kwargs) -> torch.Tensor: ...
completions is the list of model-generated tactic strings for the current batch. The function must return a 1-D torch.Tensor of float reward scalars, one per completion. TRL passes this callable as the sole entry in reward_funcs=[reward_func].
output_dir
str
default:"outputs"
Directory for checkpoints and the tokenizer. Created automatically if absent.
epochs_per_repo
int
default:"1"
Training epochs on the merged dataset after each new repository is added. Maps to trl.GRPOConfig.num_train_epochs.
batch_size
int
default:"1"
Per-device training batch size (per_device_train_batch_size). GRPO generates multiple completions per prompt, so larger batch sizes have a pronounced effect on GPU memory; batch_size=8 is a good starting point for 7B models.
lr
float
default:"2e-5"
Peak learning rate for the GRPO optimiser.
lora_config
Optional[peft.LoraConfig]
default:"None"
When provided, wraps the model with peft.get_peft_model for parameter- efficient fine-tuning. When None, all model weights are updated.

train Method

GRPOTrainer.train(
    repos: List[LeanGitRepo],
    database: DynamicDatabase,
    data_path: Path,
) -> None
The training loop:
  1. Appends the current repo and calls database.export_merged_data(repos_so_far, data_path).
  2. Loads <data_path>/random/train.json into a GRPODataset.
  3. Constructs a trl.GRPOTrainer with reward_funcs=[self.reward_func] and calls .train().
  4. Updates self.model from the trainer and saves the checkpoint (or LoRA adapter) to output_dir. The tokenizer is saved once at the end.
HFAgent.train() calls this method automatically. Use it directly only when you need a custom orchestration loop.

GRPODataset — Internal Dataset Class

GRPODataset reads the same traced-tactic JSON as SFTDataset but produces a dataset shaped for GRPO: each row has a problem field (the raw state_before string) and a prompt field (a two-turn list with a system instruction and a user message containing the goal). sorry tactics are skipped. The .to_hf() method returns a HuggingFace Dataset.

Full Example

This mirrors examples/grpo.py exactly:
import torch
from lean_dojo_v2.agent.hf_agent import HFAgent
from lean_dojo_v2.trainer.grpo_trainer import GRPOTrainer


def reward_func(completions, **kwargs):
    """Trivial placeholder — replace with domain-specific logic."""
    return torch.tensor([1.0] * len(completions))


trainer = GRPOTrainer(
    model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
    output_dir="outputs-deepseek",
    reward_func=reward_func,
    epochs_per_repo=1,
    batch_size=8,
    lr=2e-5,
)

agent = HFAgent(trainer=trainer)
agent.setup_github_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="b14fef0ceca29a65bc3122bf730406b33c7effe5",
)
agent.train()
agent.prove()
The reward_func above returns a constant 1.0 for every completion, which provides no learning signal beyond exploration noise. For real experiments, replace it with a function that rewards shorter valid proofs, penalises repeated sorry usage, or — ideally — verifies the generated tactic against Pantograph and awards +1.0 only for successful proof steps.

Writing a Useful Reward Function

A reward function has full access to the **kwargs that TRL forwards, which includes the original prompt. Here is a sketch of a length-penalised reward:
import torch

def reward_func(completions, **kwargs):
    rewards = []
    for tactic in completions:
        # Penalise long tactics; reward short ones
        length_penalty = max(0.0, 1.0 - len(tactic) / 200)
        # Hard block on sorry
        if "sorry" in tactic or "admit" in tactic:
            rewards.append(-1.0)
        else:
            rewards.append(length_penalty)
    return torch.tensor(rewards)
For the strongest signal, wire reward_func to a Pantograph server and award +1.0 only when the generated tactic compiles and advances the proof state. This turns GRPO into a genuine policy-gradient loop on verified Lean feedback.

GRPO with LoRA

from peft import LoraConfig
from lean_dojo_v2.trainer.grpo_trainer import GRPOTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
)

trainer = GRPOTrainer(
    model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
    output_dir="outputs-grpo-lora",
    reward_func=reward_func,
    epochs_per_repo=1,
    batch_size=4,
    lr=2e-5,
    lora_config=lora_config,
)
GRPO generates multiple completions per prompt in each training step, which multiplies GPU memory pressure compared to SFT. For 7B-parameter models, start with batch_size=48 and monitor VRAM usage carefully. LoRA (lora_config) significantly reduces the active parameter count and is strongly recommended for single-GPU setups.

Build docs developers (and LLMs) love