Proximal Policy Optimization (PPO) is one of the most widely used policy gradient algorithms in modern reinforcement learning, including large-scale LLM fine-tuning. verl provides a production-ready PPO implementation backed by either FSDP or Megatron-LM, supporting GAE-based advantage estimation, adaptive KL divergence control, and Dual-Clip extensions—all configurable through a composable Hydra config tree.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
How PPO Works
PPO is an actor-critic algorithm: it trains two models simultaneously—an actor (the policy being optimized) and a critic (a value function that estimates expected returns). The critic’s predictions feed into Generalized Advantage Estimation (GAE), which produces low-variance advantage values for each token. The actor is then updated using a clipped surrogate objective that limits how far the new policy can deviate from the old one, preventing the instability that plagues vanilla policy gradient methods.When to use PPO
PPO is the right choice when training stability is paramount, when you have sufficient GPU memory to host both actor and critic models, or when your task benefits from the bias–variance trade-off that GAE provides over simpler advantage estimators.
PPO vs GRPO
GRPO is a critic-free alternative that uses group-relative reward normalization. It uses less memory because there is no critic model to train, but PPO’s critic typically delivers more stable advantage estimates, especially on tasks with sparse or delayed rewards.
Key Configuration Parameters
The table below summarises the most important knobs. Note that parameters containingmicro_batch_size control the maximum number of samples per GPU forward/backward pass to avoid OOMs; they do not affect algorithmic behaviour.
| Parameter | Description | Default |
|---|---|---|
data.train_batch_size | Global batch size of prompts per iteration. Total trajectories = train_batch_size × rollout.n | — |
actor_rollout_ref.actor.ppo_mini_batch_size | Global mini-batch size for actor gradient updates | — |
critic.ppo_mini_batch_size | Global mini-batch size for critic gradient updates | — |
actor_rollout_ref.actor.clip_ratio | PPO clip range ε | 0.2 |
actor_rollout_ref.actor.ppo_epochs | Epochs of actor updates per rollout | — |
critic.ppo_epochs | Epochs of critic updates per rollout (defaults to actor value) | — |
algorithm.gamma | Discount factor γ | — |
algorithm.lam | GAE λ — trades off bias vs. variance in the advantage estimator | — |
algorithm.adv_estimator | Advantage estimator: gae, grpo, reinforce_plus_plus, reinforce_plus_plus_baseline, rloo | gae |
KL Divergence Control
Without regularisation, the policy can drift far from the reference (SFT) model during RL training. verl provides two complementary mechanisms to prevent this.- KL Reward Penalty
- KL Loss
A KL penalty is subtracted from the task reward at every step, keeping the policy close to the reference model throughout training. This mirrors the approach used in InstructGPT.
| Parameter | Description | Default |
|---|---|---|
algorithm.use_kl_in_reward | Enable in-reward KL penalty | False |
algorithm.kl_penalty | KL estimator type: kl (k1), abs, mse (k2), low_var_kl (k3), full | — |
algorithm.kl_ctrl.kl_coef | Initial KL penalty coefficient | 0.001 |
algorithm.kl_ctrl.type | Controller type: fixed or adaptive | — |
algorithm.kl_ctrl.horizon | Horizon for the adaptive controller | — |
algorithm.kl_ctrl.target_kl | Target KL for the adaptive controller | — |
Running PPO
Minimal Example
The following command fine-tunesQwen2.5-0.5B-Instruct on GSM8K with PPO using GAE advantages:
Using the Canonical Script
For a fully-featured run with sensible defaults for Qwen3-8B on GSM8K + MATH:Megatron-LM Backend
For large models that require tensor parallelism, use the Megatron-LM training script:Advanced Options
FSDP2 Training Strategy
Switch to the FSDP2 sharding strategy by setting:CPU Offload for Gradient Accumulation
When GPU memory is constrained, you can offload parameters and optimiser states to CPU:Dual-Clip PPO
Standard PPO only clips the probability ratio from above. Dual-Clip PPO adds a lower bound on the ratio when the advantage is negative, preventing overly conservative updates in the negative-advantage case.Entropy Regularisation
A small entropy bonus encourages exploration and can prevent premature convergence:The default value of
actor_rollout_ref.actor.entropy_coeff was changed from a non-zero value to 0.0 in verl 0.3.x (2025-05-30). If you are comparing with results from older checkpoints, verify the entropy coefficient used.Reference Performance
The table below shows validated GSM8K test scores using verl v0.2.| Model | Method | GSM8K Score | Notes |
|---|---|---|---|
| Qwen/Qwen2.5-0.5B-Instruct | Pretrained | 36.4 | Baseline |
| Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | Training log |
| deepseek-ai/deepseek-llm-7b-chat | PPO (Megatron) | 69.5 | Training log |
| deepseek-ai/deepseek-llm-7b-chat | PPO (AMD MI300) | 70.5 | Training log |