GRPO in verl: Critic-Free Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning algorithm introduced in the DeepSeekMath paper. Instead of training a separate value network to estimate per-state baselines, GRPO generates a group of responses for each prompt and normalises rewards within that group. The group mean acts as the baseline, removing the need for a critic model entirely and significantly reducing memory consumption compared to PPO.

GRPO vs PPO

GRPO

No critic model. Memory footprint is roughly half that of PPO for the same base model size. Advantage is computed by group-relative reward normalisation — fast and simple. Best suited for tasks with verifiable, scalar rewards (math, coding, structured output).

PPO

Requires a critic model. Higher memory, but the learned value function often produces more stable advantage estimates, especially on tasks with long horizons or sparse rewards. More configurable bias–variance trade-off via GAE.

How GRPO Works

Group Sampling

For each prompt in the batch, the rollout engine samples n independent responses (controlled by actor_rollout_ref.rollout.n). This set of n responses is called a group.

Reward Assignment

Each response is scored by the reward model or rule-based function, producing a scalar reward per response.

Group-Relative Normalisation

Within each group the rewards are standardised: subtract the group mean and divide by the group standard deviation. The resulting values serve as token-level advantages.

Policy Update

The actor is updated using the clipped surrogate objective (identical to PPO) against the normalised advantages. KL regularisation is applied as a loss term rather than a reward penalty.

Configuration

Note that parameters containing micro_batch_size control the maximum sample count per GPU forward/backward pass to prevent OOMs; they do not affect algorithmic behaviour. Despite the ppo_ prefix on several parameters, they apply to GRPO as well — the GRPO training loop mirrors PPO’s, minus the critic.

Core Parameters

Parameter	Description	Recommended
`actor_rollout_ref.rollout.n`	Responses sampled per prompt (group size). Must be > 1 for GRPO.	`5`–`8`
`data.train_batch_size`	Global prompt batch size per iteration. Total trajectories = `train_batch_size × n`	—
`actor_rollout_ref.actor.ppo_mini_batch_size`	Global mini-batch for actor gradient updates	—
`actor_rollout_ref.actor.ppo_epochs`	Inner-loop epochs over sampled trajectories	—
`actor_rollout_ref.actor.clip_ratio`	PPO clip range ε	`0.2`
`algorithm.adv_estimator`	Must be set to `grpo`	`grpo`

KL Regularisation

GRPO adds KL divergence between the actor and the reference policy directly to the training loss (rather than the reward). This is the canonical GRPO setup.

Parameter	Description	Default
`actor_rollout_ref.actor.use_kl_loss`	Enable KL loss. Set to `True` for GRPO.	`False`
`actor_rollout_ref.actor.kl_loss_coef`	Weight of the KL loss term	`0.001`
`actor_rollout_ref.actor.kl_loss_type`	KL estimator: `kl` (k1), `abs`, `mse` (k2), `low_var_kl` (k3), `full`. Append `+` (e.g. `k3+`) for straight-through k2 unbiased gradient estimation	`low_var_kl`

For a detailed comparison of KL approximation methods, see Approximating KL Divergence by John Schulman.

Loss Aggregation Mode

How individual token losses are aggregated into a scalar for the gradient update matters for training stability, especially in long-CoT settings.

`loss_agg_mode`	Behaviour	Notes
`token-mean`	Mean over all tokens in all sequences in a mini-batch	Default and recommended for most use cases
`seq-mean-token-sum`	Sum tokens per sequence, then mean across sequences	Original GRPO paper formulation
`seq-mean-token-mean`	Mean tokens per sequence, then mean across sequences	Can be unstable with variable-length long CoT

Running GRPO

Minimal Example

python3 -m verl.trainer.main_ppo \
    algorithm=grpo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.rollout.n=8 \
    algorithm.adv_estimator=grpo \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    trainer.total_epochs=15

Using the Canonical Script

verl ships ready-to-run scripts for a wide range of models. Each script exposes its tunable knobs as environment variables:

FSDP (NVIDIA / NPU)
Megatron-LM
Large-Scale (MoE / 671B)

bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

Override knobs inline:

MODEL_PATH=Qwen/Qwen3-8B \
ROLLOUT_N=8 \
TRAIN_BATCH_SIZE=1024 \
INFER_BACKEND=sglang \
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

bash examples/grpo_trainer/run_qwen3_8b_megatron.sh

Megatron scripts are preferred for models above ~30B where tensor parallelism is required.

For MoE and 671B-class models:

# Qwen3-30B-A3B MoE on FSDP
bash examples/grpo_trainer/run_qwen3_30b_a3b_fsdp.sh

# DeepSeek-V3 671B on Megatron
bash examples/grpo_trainer/run_deepseek_v3_671b_megatron.sh

Advanced Extensions

DrGRPO: Eliminating Length Bias

Standard GRPO normalises rewards at the sequence level, which can create an implicit incentive for the model to generate longer responses — particularly for incorrect outputs. DrGRPO (from Understanding R1-Zero-Like Training: A Critical Perspective) corrects this by normalising token-level losses with a global constant instead of per-sequence lengths.

Set loss aggregation to seq-mean-token-sum-norm

This disables sequence-dimension averaging, preventing length from influencing the loss scale.

actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-sum-norm

Disable standard deviation normalisation

algorithm.norm_adv_by_std_in_grpo=False

Disable KL loss

DrGRPO does not use the KL loss term:

actor_rollout_ref.actor.use_kl_loss=False

(Optional) Set a fixed normalisation constant

If not set, the current batch’s response length is used. Setting an explicit constant ensures consistent normalisation across batches:

actor_rollout_ref.actor.loss_scale_factor=2048   # e.g. max response length

DAPO: Decoupled Clip and Dynamic Sampling

DAPO extends GRPO with two key innovations: decoupled clip ratios (separate ε values for positive and negative advantages) and dynamic sampling (filtering out groups where all responses succeed or all fail before the gradient update). Applied to Qwen2.5-32B base, DAPO achieves 50% accuracy on AIME 2024. For full details and configuration, see More Algorithms — DAPO.

Reference Performance

Hardware	Model	Method	GSM8K Score
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO	89
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (FSDP2)	89.8
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (Megatron)	89.6
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	GRPO-LoRA	93.4
AMD MI300	deepseek-ai/deepseek-llm-7b-chat	GRPO	71.4

For a full comparison across models, methods, and datasets, see the Baselines page.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

GRPO in verl: Critic-Free Group Relative Policy Optimization

GRPO vs PPO

GRPO

PPO

How GRPO Works

Configuration

Core Parameters

KL Regularisation

Loss Aggregation Mode

Running GRPO

Minimal Example

Using the Canonical Script

Advanced Extensions

DrGRPO: Eliminating Length Bias

DAPO: Decoupled Clip and Dynamic Sampling

Reference Performance

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​GRPO vs PPO

GRPO

PPO

​How GRPO Works

​Configuration

​Core Parameters

​KL Regularisation

​Loss Aggregation Mode

​Running GRPO

​Minimal Example

​Using the Canonical Script

​Advanced Extensions

​DrGRPO: Eliminating Length Bias

​DAPO: Decoupled Clip and Dynamic Sampling

​Reference Performance

Build docs developers (and LLMs) love

GRPO vs PPO

How GRPO Works

Configuration

Core Parameters

KL Regularisation

Loss Aggregation Mode

Running GRPO

Minimal Example

Using the Canonical Script

Advanced Extensions

DrGRPO: Eliminating Length Bias

DAPO: Decoupled Clip and Dynamic Sampling

Reference Performance