Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
GRPOTrainer brings reinforcement-learning-style fine-tuning to Lean 4 tactic generation via TRL’s GRPOTrainer (Group Relative Policy Optimization). Instead of learning from ground-truth labels alone, the model is rewarded according to a caller-supplied reward_func — making it straightforward to inject domain knowledge such as proof-length penalties, tactic novelty bonuses, or verified-proof signals. The training loop mirrors SFTTrainer: repos are processed in order, the merged dataset grows with each iteration, and a checkpoint is written after every repository.
Class
Constructor Parameters
HuggingFace model identifier or local path loaded with
AutoModelForCausalLM.from_pretrained in bfloat16.
Example: "deepseek-ai/DeepSeek-Prover-V2-7B".A callable with signature:
completions is the list of model-generated tactic strings for the current
batch. The function must return a 1-D torch.Tensor of float reward scalars,
one per completion. TRL passes this callable as the sole entry in
reward_funcs=[reward_func].Directory for checkpoints and the tokenizer. Created automatically if absent.
Training epochs on the merged dataset after each new repository is added.
Maps to
trl.GRPOConfig.num_train_epochs.Per-device training batch size (
per_device_train_batch_size). GRPO generates
multiple completions per prompt, so larger batch sizes have a pronounced effect
on GPU memory; batch_size=8 is a good starting point for 7B models.Peak learning rate for the GRPO optimiser.
When provided, wraps the model with
peft.get_peft_model for parameter-
efficient fine-tuning. When None, all model weights are updated.train Method
- Appends the current repo and calls
database.export_merged_data(repos_so_far, data_path). - Loads
<data_path>/random/train.jsoninto aGRPODataset. - Constructs a
trl.GRPOTrainerwithreward_funcs=[self.reward_func]and calls.train(). - Updates
self.modelfrom the trainer and saves the checkpoint (or LoRA adapter) tooutput_dir. The tokenizer is saved once at the end.
HFAgent.train() calls this method automatically. Use it directly only when
you need a custom orchestration loop.GRPODataset — Internal Dataset Class
GRPODataset reads the same traced-tactic JSON as SFTDataset but produces a
dataset shaped for GRPO: each row has a problem field (the raw state_before
string) and a prompt field (a two-turn list with a system instruction and a
user message containing the goal). sorry tactics are skipped. The .to_hf()
method returns a HuggingFace Dataset.
Full Example
This mirrorsexamples/grpo.py exactly:
The
reward_func above returns a constant 1.0 for every completion, which
provides no learning signal beyond exploration noise. For real experiments,
replace it with a function that rewards shorter valid proofs, penalises
repeated sorry usage, or — ideally — verifies the generated tactic against
Pantograph and awards +1.0 only for successful proof steps.Writing a Useful Reward Function
A reward function has full access to the**kwargs that TRL forwards, which
includes the original prompt. Here is a sketch of a length-penalised reward: