Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning algorithm introduced in the DeepSeekMath paper. Instead of training a separate value network to estimate per-state baselines, GRPO generates a group of responses for each prompt and normalises rewards within that group. The group mean acts as the baseline, removing the need for a critic model entirely and significantly reducing memory consumption compared to PPO.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
GRPO vs PPO
GRPO
No critic model. Memory footprint is roughly half that of PPO for the same base model size. Advantage is computed by group-relative reward normalisation — fast and simple. Best suited for tasks with verifiable, scalar rewards (math, coding, structured output).
PPO
Requires a critic model. Higher memory, but the learned value function often produces more stable advantage estimates, especially on tasks with long horizons or sparse rewards. More configurable bias–variance trade-off via GAE.
How GRPO Works
Group Sampling
For each prompt in the batch, the rollout engine samples n independent responses (controlled by
actor_rollout_ref.rollout.n). This set of n responses is called a group.Reward Assignment
Each response is scored by the reward model or rule-based function, producing a scalar reward per response.
Group-Relative Normalisation
Within each group the rewards are standardised: subtract the group mean and divide by the group standard deviation. The resulting values serve as token-level advantages.
Configuration
Note that parameters containingmicro_batch_size control the maximum sample count per GPU forward/backward pass to prevent OOMs; they do not affect algorithmic behaviour. Despite the ppo_ prefix on several parameters, they apply to GRPO as well — the GRPO training loop mirrors PPO’s, minus the critic.
Core Parameters
| Parameter | Description | Recommended |
|---|---|---|
actor_rollout_ref.rollout.n | Responses sampled per prompt (group size). Must be > 1 for GRPO. | 5–8 |
data.train_batch_size | Global prompt batch size per iteration. Total trajectories = train_batch_size × n | — |
actor_rollout_ref.actor.ppo_mini_batch_size | Global mini-batch for actor gradient updates | — |
actor_rollout_ref.actor.ppo_epochs | Inner-loop epochs over sampled trajectories | — |
actor_rollout_ref.actor.clip_ratio | PPO clip range ε | 0.2 |
algorithm.adv_estimator | Must be set to grpo | grpo |
KL Regularisation
GRPO adds KL divergence between the actor and the reference policy directly to the training loss (rather than the reward). This is the canonical GRPO setup.| Parameter | Description | Default |
|---|---|---|
actor_rollout_ref.actor.use_kl_loss | Enable KL loss. Set to True for GRPO. | False |
actor_rollout_ref.actor.kl_loss_coef | Weight of the KL loss term | 0.001 |
actor_rollout_ref.actor.kl_loss_type | KL estimator: kl (k1), abs, mse (k2), low_var_kl (k3), full. Append + (e.g. k3+) for straight-through k2 unbiased gradient estimation | low_var_kl |
Loss Aggregation Mode
How individual token losses are aggregated into a scalar for the gradient update matters for training stability, especially in long-CoT settings.loss_agg_mode | Behaviour | Notes |
|---|---|---|
token-mean | Mean over all tokens in all sequences in a mini-batch | Default and recommended for most use cases |
seq-mean-token-sum | Sum tokens per sequence, then mean across sequences | Original GRPO paper formulation |
seq-mean-token-mean | Mean tokens per sequence, then mean across sequences | Can be unstable with variable-length long CoT |
Running GRPO
Minimal Example
Using the Canonical Script
verl ships ready-to-run scripts for a wide range of models. Each script exposes its tunable knobs as environment variables:- FSDP (NVIDIA / NPU)
- Megatron-LM
- Large-Scale (MoE / 671B)
Advanced Extensions
DrGRPO: Eliminating Length Bias
Standard GRPO normalises rewards at the sequence level, which can create an implicit incentive for the model to generate longer responses — particularly for incorrect outputs. DrGRPO (from Understanding R1-Zero-Like Training: A Critical Perspective) corrects this by normalising token-level losses with a global constant instead of per-sequence lengths.Set loss aggregation to seq-mean-token-sum-norm
This disables sequence-dimension averaging, preventing length from influencing the loss scale.
DAPO: Decoupled Clip and Dynamic Sampling
DAPO extends GRPO with two key innovations: decoupled clip ratios (separate ε values for positive and negative advantages) and dynamic sampling (filtering out groups where all responses succeed or all fail before the gradient update). Applied to Qwen2.5-32B base, DAPO achieves 50% accuracy on AIME 2024. For full details and configuration, see More Algorithms — DAPO.Reference Performance
| Hardware | Model | Method | GSM8K Score |
|---|---|---|---|
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO | 89 |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO (FSDP2) | 89.8 |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO (Megatron) | 89.6 |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | GRPO-LoRA | 93.4 |
| AMD MI300 | deepseek-ai/deepseek-llm-7b-chat | GRPO | 71.4 |