GRPOTrainer: Reinforcement Learning for Lean Provers

GRPOTrainer brings reinforcement-learning-style fine-tuning to Lean 4 tactic generation via TRL’s GRPOTrainer (Group Relative Policy Optimization). Instead of learning from ground-truth labels alone, the model is rewarded according to a caller-supplied reward_func — making it straightforward to inject domain knowledge such as proof-length penalties, tactic novelty bonuses, or verified-proof signals. The training loop mirrors SFTTrainer: repos are processed in order, the merged dataset grows with each iteration, and a checkpoint is written after every repository.

Class

lean_dojo_v2.trainer.grpo_trainer.GRPOTrainer

Constructor Parameters

model_name

str

required

HuggingFace model identifier or local path loaded with AutoModelForCausalLM.from_pretrained in bfloat16. Example: "deepseek-ai/DeepSeek-Prover-V2-7B".

reward_func

Callable

required

A callable with signature:

def reward_func(completions: list, **kwargs) -> torch.Tensor: ...

completions is the list of model-generated tactic strings for the current batch. The function must return a 1-D torch.Tensor of float reward scalars, one per completion. TRL passes this callable as the sole entry in reward_funcs=[reward_func].

output_dir

str

default:"outputs"

Directory for checkpoints and the tokenizer. Created automatically if absent.

epochs_per_repo

int

default:"1"

Training epochs on the merged dataset after each new repository is added. Maps to trl.GRPOConfig.num_train_epochs.

batch_size

int

default:"1"

Per-device training batch size (per_device_train_batch_size). GRPO generates multiple completions per prompt, so larger batch sizes have a pronounced effect on GPU memory; batch_size=8 is a good starting point for 7B models.

float

default:"2e-5"

Peak learning rate for the GRPO optimiser.

lora_config

Optional[peft.LoraConfig]

default:"None"

When provided, wraps the model with peft.get_peft_model for parameter- efficient fine-tuning. When None, all model weights are updated.

`train` Method

GRPOTrainer.train(
    repos: List[LeanGitRepo],
    database: DynamicDatabase,
    data_path: Path,
) -> None

The training loop:

Appends the current repo and calls database.export_merged_data(repos_so_far, data_path).
Loads <data_path>/random/train.json into a GRPODataset.
Constructs a trl.GRPOTrainer with reward_funcs=[self.reward_func] and calls .train().
Updates self.model from the trainer and saves the checkpoint (or LoRA adapter) to output_dir. The tokenizer is saved once at the end.

HFAgent.train() calls this method automatically. Use it directly only when you need a custom orchestration loop.

`GRPODataset` — Internal Dataset Class

GRPODataset reads the same traced-tactic JSON as SFTDataset but produces a dataset shaped for GRPO: each row has a problem field (the raw state_before string) and a prompt field (a two-turn list with a system instruction and a user message containing the goal). sorry tactics are skipped. The .to_hf() method returns a HuggingFace Dataset.

Full Example

This mirrors examples/grpo.py exactly:

import torch
from lean_dojo_v2.agent.hf_agent import HFAgent
from lean_dojo_v2.trainer.grpo_trainer import GRPOTrainer


def reward_func(completions, **kwargs):
    """Trivial placeholder — replace with domain-specific logic."""
    return torch.tensor([1.0] * len(completions))


trainer = GRPOTrainer(
    model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
    output_dir="outputs-deepseek",
    reward_func=reward_func,
    epochs_per_repo=1,
    batch_size=8,
    lr=2e-5,
)

agent = HFAgent(trainer=trainer)
agent.setup_github_repository(
    url="https://github.com/durant42040/lean4-example",
    commit="b14fef0ceca29a65bc3122bf730406b33c7effe5",
)
agent.train()
agent.prove()

The reward_func above returns a constant 1.0 for every completion, which provides no learning signal beyond exploration noise. For real experiments, replace it with a function that rewards shorter valid proofs, penalises repeated sorry usage, or — ideally — verifies the generated tactic against Pantograph and awards +1.0 only for successful proof steps.

Writing a Useful Reward Function

A reward function has full access to the **kwargs that TRL forwards, which includes the original prompt. Here is a sketch of a length-penalised reward:

import torch

def reward_func(completions, **kwargs):
    rewards = []
    for tactic in completions:
        # Penalise long tactics; reward short ones
        length_penalty = max(0.0, 1.0 - len(tactic) / 200)
        # Hard block on sorry
        if "sorry" in tactic or "admit" in tactic:
            rewards.append(-1.0)
        else:
            rewards.append(length_penalty)
    return torch.tensor(rewards)

For the strongest signal, wire reward_func to a Pantograph server and award +1.0 only when the generated tactic compiles and advances the proof state. This turns GRPO into a genuine policy-gradient loop on verified Lean feedback.

GRPO with LoRA

from peft import LoraConfig
from lean_dojo_v2.trainer.grpo_trainer import GRPOTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
)

trainer = GRPOTrainer(
    model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
    output_dir="outputs-grpo-lora",
    reward_func=reward_func,
    epochs_per_repo=1,
    batch_size=4,
    lr=2e-5,
    lora_config=lora_config,
)

GRPO generates multiple completions per prompt in each training step, which multiplies GPU memory pressure compared to SFT. For 7B-parameter models, start with batch_size=4–8 and monitor VRAM usage carefully. LoRA (lora_config) significantly reduces the active parameter count and is strongly recommended for single-GPU setups.

Agents

Trainers

Provers

Data & Database

GRPOTrainer: Reinforcement Learning for Lean Provers

Class

Constructor Parameters

`train` Method

`GRPODataset` — Internal Dataset Class

Full Example

Writing a Useful Reward Function

GRPO with LoRA

Build docs developers (and LLMs) love

Agents

Trainers

Provers

Data & Database

Documentation Index

​Class

​Constructor Parameters

​train Method

​GRPODataset — Internal Dataset Class

​Full Example

​Writing a Useful Reward Function

​GRPO with LoRA

Build docs developers (and LLMs) love

Class

Constructor Parameters

`train` Method

`GRPODataset` — Internal Dataset Class

Full Example

Writing a Useful Reward Function

GRPO with LoRA