PPO in verl: Configuration and Training Guide

Proximal Policy Optimization (PPO) is one of the most widely used policy gradient algorithms in modern reinforcement learning, including large-scale LLM fine-tuning. verl provides a production-ready PPO implementation backed by either FSDP or Megatron-LM, supporting GAE-based advantage estimation, adaptive KL divergence control, and Dual-Clip extensions—all configurable through a composable Hydra config tree.

How PPO Works

PPO is an actor-critic algorithm: it trains two models simultaneously—an actor (the policy being optimized) and a critic (a value function that estimates expected returns). The critic’s predictions feed into Generalized Advantage Estimation (GAE), which produces low-variance advantage values for each token. The actor is then updated using a clipped surrogate objective that limits how far the new policy can deviate from the old one, preventing the instability that plagues vanilla policy gradient methods.

When to use PPO

PPO is the right choice when training stability is paramount, when you have sufficient GPU memory to host both actor and critic models, or when your task benefits from the bias–variance trade-off that GAE provides over simpler advantage estimators.

PPO vs GRPO

GRPO is a critic-free alternative that uses group-relative reward normalization. It uses less memory because there is no critic model to train, but PPO’s critic typically delivers more stable advantage estimates, especially on tasks with sparse or delayed rewards.

Key Configuration Parameters

The table below summarises the most important knobs. Note that parameters containing micro_batch_size control the maximum number of samples per GPU forward/backward pass to avoid OOMs; they do not affect algorithmic behaviour.

Parameter	Description	Default
`data.train_batch_size`	Global batch size of prompts per iteration. Total trajectories = `train_batch_size × rollout.n`	—
`actor_rollout_ref.actor.ppo_mini_batch_size`	Global mini-batch size for actor gradient updates	—
`critic.ppo_mini_batch_size`	Global mini-batch size for critic gradient updates	—
`actor_rollout_ref.actor.clip_ratio`	PPO clip range ε	`0.2`
`actor_rollout_ref.actor.ppo_epochs`	Epochs of actor updates per rollout	—
`critic.ppo_epochs`	Epochs of critic updates per rollout (defaults to actor value)	—
`algorithm.gamma`	Discount factor γ	—
`algorithm.lam`	GAE λ — trades off bias vs. variance in the advantage estimator	—
`algorithm.adv_estimator`	Advantage estimator: `gae`, `grpo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `rloo`	`gae`

KL Divergence Control

Without regularisation, the policy can drift far from the reference (SFT) model during RL training. verl provides two complementary mechanisms to prevent this.

KL Reward Penalty
KL Loss

A KL penalty is subtracted from the task reward at every step, keeping the policy close to the reference model throughout training. This mirrors the approach used in InstructGPT.

Parameter	Description	Default
`algorithm.use_kl_in_reward`	Enable in-reward KL penalty	`False`
`algorithm.kl_penalty`	KL estimator type: `kl` (k1), `abs`, `mse` (k2), `low_var_kl` (k3), `full`	—
`algorithm.kl_ctrl.kl_coef`	Initial KL penalty coefficient	`0.001`
`algorithm.kl_ctrl.type`	Controller type: `fixed` or `adaptive`	—
`algorithm.kl_ctrl.horizon`	Horizon for the adaptive controller	—
`algorithm.kl_ctrl.target_kl`	Target KL for the adaptive controller	—

The KL divergence between the actor and the reference policy is added directly to the training loss. When this is enabled, the in-reward KL penalty should be disabled.

Parameter	Description	Default
`actor_rollout_ref.actor.use_kl_loss`	Add KL divergence to the actor loss	`False`
`actor_rollout_ref.actor.kl_loss_coef`	Coefficient of the KL loss term	`0.001`
`actor_rollout_ref.actor.kl_loss_type`	KL estimator: `kl` (k1), `abs`, `mse` (k2), `low_var_kl` (k3), `full`. Append `+` (e.g. `k1+`, `k3+`) to enable straight-through k2 for unbiased gradient estimation	—

For a detailed analysis of the different KL approximation methods (k1, k2, k3), see Approximating KL Divergence by John Schulman.

Running PPO

Minimal Example

The following command fine-tunes Qwen2.5-0.5B-Instruct on GSM8K with PPO using GAE advantages:

python3 -m verl.trainer.main_ppo \
    algorithm=ppo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=1024 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    algorithm.adv_estimator=gae \
    trainer.total_epochs=15

Using the Canonical Script

For a fully-featured run with sensible defaults for Qwen3-8B on GSM8K + MATH:

bash examples/ppo_trainer/run_qwen3_8b_fsdp.sh

The script exposes the most commonly tuned knobs as environment variables:

MODEL_PATH=Qwen/Qwen3-8B \
TRAIN_BATCH_SIZE=1024 \
PPO_MINI_BATCH_SIZE=256 \
TOTAL_EPOCHS=15 \
bash examples/ppo_trainer/run_qwen3_8b_fsdp.sh

Megatron-LM Backend

For large models that require tensor parallelism, use the Megatron-LM training script:

bash examples/ppo_trainer/run_qwen3_8b_megatron.sh

Advanced Options

FSDP2 Training Strategy

Switch to the FSDP2 sharding strategy by setting:

actor_rollout_ref.actor.strategy=fsdp2

FSDP2 provides improved memory efficiency and supports per-parameter sharding, which can be beneficial for very large models.

CPU Offload for Gradient Accumulation

When GPU memory is constrained, you can offload parameters and optimiser states to CPU:

actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True

Dual-Clip PPO

Standard PPO only clips the probability ratio from above. Dual-Clip PPO adds a lower bound on the ratio when the advantage is negative, preventing overly conservative updates in the negative-advantage case.

actor_rollout_ref.actor.clip_ratio_c=3.0   # lower bound, default 3.0

The objective becomes:

pg_losses1 = -advantages * ratio
pg_losses2 = -advantages * clamp(ratio, 1 - ε, 1 + ε)
loss       = max(pg_losses1, pg_losses2, -clip_ratio_c * advantages)  # when adv < 0

Entropy Regularisation

A small entropy bonus encourages exploration and can prevent premature convergence:

actor_rollout_ref.actor.entropy_coeff=0.01

The default value of actor_rollout_ref.actor.entropy_coeff was changed from a non-zero value to 0.0 in verl 0.3.x (2025-05-30). If you are comparing with results from older checkpoints, verify the entropy coefficient used.

Reference Performance

The table below shows validated GSM8K test scores using verl v0.2.

Model	Method	GSM8K Score	Notes
Qwen/Qwen2.5-0.5B-Instruct	Pretrained	36.4	Baseline
Qwen/Qwen2.5-0.5B-Instruct	PPO	56.7	Training log
deepseek-ai/deepseek-llm-7b-chat	PPO (Megatron)	69.5	Training log
deepseek-ai/deepseek-llm-7b-chat	PPO (AMD MI300)	70.5	Training log

For comprehensive baselines across models, datasets, and algorithms, see the Baselines page.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

PPO in verl: Configuration and Training Guide

How PPO Works

When to use PPO

PPO vs GRPO

Key Configuration Parameters

KL Divergence Control

Running PPO

Minimal Example

Using the Canonical Script

Megatron-LM Backend

Advanced Options

FSDP2 Training Strategy

CPU Offload for Gradient Accumulation

Dual-Clip PPO

Entropy Regularisation

Reference Performance

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​How PPO Works

When to use PPO

PPO vs GRPO

​Key Configuration Parameters

​KL Divergence Control

​Running PPO

​Minimal Example

​Using the Canonical Script

​Megatron-LM Backend

​Advanced Options

​FSDP2 Training Strategy

​CPU Offload for Gradient Accumulation

​Dual-Clip PPO

​Entropy Regularisation

​Reference Performance

Build docs developers (and LLMs) love

How PPO Works

Key Configuration Parameters

KL Divergence Control

Running PPO

Minimal Example

Using the Canonical Script

Megatron-LM Backend

Advanced Options

FSDP2 Training Strategy

CPU Offload for Gradient Accumulation

Dual-Clip PPO

Entropy Regularisation

Reference Performance