Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning algorithm introduced in the DeepSeekMath paper. Instead of training a separate value network to estimate per-state baselines, GRPO generates a group of responses for each prompt and normalises rewards within that group. The group mean acts as the baseline, removing the need for a critic model entirely and significantly reducing memory consumption compared to PPO.

GRPO vs PPO

GRPO

No critic model. Memory footprint is roughly half that of PPO for the same base model size. Advantage is computed by group-relative reward normalisation — fast and simple. Best suited for tasks with verifiable, scalar rewards (math, coding, structured output).

PPO

Requires a critic model. Higher memory, but the learned value function often produces more stable advantage estimates, especially on tasks with long horizons or sparse rewards. More configurable bias–variance trade-off via GAE.

How GRPO Works

1

Group Sampling

For each prompt in the batch, the rollout engine samples n independent responses (controlled by actor_rollout_ref.rollout.n). This set of n responses is called a group.
2

Reward Assignment

Each response is scored by the reward model or rule-based function, producing a scalar reward per response.
3

Group-Relative Normalisation

Within each group the rewards are standardised: subtract the group mean and divide by the group standard deviation. The resulting values serve as token-level advantages.
4

Policy Update

The actor is updated using the clipped surrogate objective (identical to PPO) against the normalised advantages. KL regularisation is applied as a loss term rather than a reward penalty.

Configuration

Note that parameters containing micro_batch_size control the maximum sample count per GPU forward/backward pass to prevent OOMs; they do not affect algorithmic behaviour. Despite the ppo_ prefix on several parameters, they apply to GRPO as well — the GRPO training loop mirrors PPO’s, minus the critic.

Core Parameters

ParameterDescriptionRecommended
actor_rollout_ref.rollout.nResponses sampled per prompt (group size). Must be > 1 for GRPO.58
data.train_batch_sizeGlobal prompt batch size per iteration. Total trajectories = train_batch_size × n
actor_rollout_ref.actor.ppo_mini_batch_sizeGlobal mini-batch for actor gradient updates
actor_rollout_ref.actor.ppo_epochsInner-loop epochs over sampled trajectories
actor_rollout_ref.actor.clip_ratioPPO clip range ε0.2
algorithm.adv_estimatorMust be set to grpogrpo

KL Regularisation

GRPO adds KL divergence between the actor and the reference policy directly to the training loss (rather than the reward). This is the canonical GRPO setup.
ParameterDescriptionDefault
actor_rollout_ref.actor.use_kl_lossEnable KL loss. Set to True for GRPO.False
actor_rollout_ref.actor.kl_loss_coefWeight of the KL loss term0.001
actor_rollout_ref.actor.kl_loss_typeKL estimator: kl (k1), abs, mse (k2), low_var_kl (k3), full. Append + (e.g. k3+) for straight-through k2 unbiased gradient estimationlow_var_kl
For a detailed comparison of KL approximation methods, see Approximating KL Divergence by John Schulman.

Loss Aggregation Mode

How individual token losses are aggregated into a scalar for the gradient update matters for training stability, especially in long-CoT settings.
loss_agg_modeBehaviourNotes
token-meanMean over all tokens in all sequences in a mini-batchDefault and recommended for most use cases
seq-mean-token-sumSum tokens per sequence, then mean across sequencesOriginal GRPO paper formulation
seq-mean-token-meanMean tokens per sequence, then mean across sequencesCan be unstable with variable-length long CoT

Running GRPO

Minimal Example

python3 -m verl.trainer.main_ppo \
    algorithm=grpo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.rollout.n=8 \
    algorithm.adv_estimator=grpo \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    trainer.total_epochs=15

Using the Canonical Script

verl ships ready-to-run scripts for a wide range of models. Each script exposes its tunable knobs as environment variables:
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh
Override knobs inline:
MODEL_PATH=Qwen/Qwen3-8B \
ROLLOUT_N=8 \
TRAIN_BATCH_SIZE=1024 \
INFER_BACKEND=sglang \
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

Advanced Extensions

DrGRPO: Eliminating Length Bias

Standard GRPO normalises rewards at the sequence level, which can create an implicit incentive for the model to generate longer responses — particularly for incorrect outputs. DrGRPO (from Understanding R1-Zero-Like Training: A Critical Perspective) corrects this by normalising token-level losses with a global constant instead of per-sequence lengths.
1

Set loss aggregation to seq-mean-token-sum-norm

This disables sequence-dimension averaging, preventing length from influencing the loss scale.
actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-sum-norm
2

Disable standard deviation normalisation

algorithm.norm_adv_by_std_in_grpo=False
3

Disable KL loss

DrGRPO does not use the KL loss term:
actor_rollout_ref.actor.use_kl_loss=False
4

(Optional) Set a fixed normalisation constant

If not set, the current batch’s response length is used. Setting an explicit constant ensures consistent normalisation across batches:
actor_rollout_ref.actor.loss_scale_factor=2048   # e.g. max response length

DAPO: Decoupled Clip and Dynamic Sampling

DAPO extends GRPO with two key innovations: decoupled clip ratios (separate ε values for positive and negative advantages) and dynamic sampling (filtering out groups where all responses succeed or all fail before the gradient update). Applied to Qwen2.5-32B base, DAPO achieves 50% accuracy on AIME 2024. For full details and configuration, see More Algorithms — DAPO.

Reference Performance

HardwareModelMethodGSM8K Score
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO89
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO (FSDP2)89.8
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO (Megatron)89.6
NVIDIA GPUQwen/Qwen2.5-7B-InstructGRPO-LoRA93.4
AMD MI300deepseek-ai/deepseek-llm-7b-chatGRPO71.4
For a full comparison across models, methods, and datasets, see the Baselines page.

Build docs developers (and LLMs) love