Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

Proximal Policy Optimization (PPO) is one of the most widely used policy gradient algorithms in modern reinforcement learning, including large-scale LLM fine-tuning. verl provides a production-ready PPO implementation backed by either FSDP or Megatron-LM, supporting GAE-based advantage estimation, adaptive KL divergence control, and Dual-Clip extensions—all configurable through a composable Hydra config tree.

How PPO Works

PPO is an actor-critic algorithm: it trains two models simultaneously—an actor (the policy being optimized) and a critic (a value function that estimates expected returns). The critic’s predictions feed into Generalized Advantage Estimation (GAE), which produces low-variance advantage values for each token. The actor is then updated using a clipped surrogate objective that limits how far the new policy can deviate from the old one, preventing the instability that plagues vanilla policy gradient methods.

When to use PPO

PPO is the right choice when training stability is paramount, when you have sufficient GPU memory to host both actor and critic models, or when your task benefits from the bias–variance trade-off that GAE provides over simpler advantage estimators.

PPO vs GRPO

GRPO is a critic-free alternative that uses group-relative reward normalization. It uses less memory because there is no critic model to train, but PPO’s critic typically delivers more stable advantage estimates, especially on tasks with sparse or delayed rewards.

Key Configuration Parameters

The table below summarises the most important knobs. Note that parameters containing micro_batch_size control the maximum number of samples per GPU forward/backward pass to avoid OOMs; they do not affect algorithmic behaviour.
ParameterDescriptionDefault
data.train_batch_sizeGlobal batch size of prompts per iteration. Total trajectories = train_batch_size × rollout.n
actor_rollout_ref.actor.ppo_mini_batch_sizeGlobal mini-batch size for actor gradient updates
critic.ppo_mini_batch_sizeGlobal mini-batch size for critic gradient updates
actor_rollout_ref.actor.clip_ratioPPO clip range ε0.2
actor_rollout_ref.actor.ppo_epochsEpochs of actor updates per rollout
critic.ppo_epochsEpochs of critic updates per rollout (defaults to actor value)
algorithm.gammaDiscount factor γ
algorithm.lamGAE λ — trades off bias vs. variance in the advantage estimator
algorithm.adv_estimatorAdvantage estimator: gae, grpo, reinforce_plus_plus, reinforce_plus_plus_baseline, rloogae

KL Divergence Control

Without regularisation, the policy can drift far from the reference (SFT) model during RL training. verl provides two complementary mechanisms to prevent this.
A KL penalty is subtracted from the task reward at every step, keeping the policy close to the reference model throughout training. This mirrors the approach used in InstructGPT.
ParameterDescriptionDefault
algorithm.use_kl_in_rewardEnable in-reward KL penaltyFalse
algorithm.kl_penaltyKL estimator type: kl (k1), abs, mse (k2), low_var_kl (k3), full
algorithm.kl_ctrl.kl_coefInitial KL penalty coefficient0.001
algorithm.kl_ctrl.typeController type: fixed or adaptive
algorithm.kl_ctrl.horizonHorizon for the adaptive controller
algorithm.kl_ctrl.target_klTarget KL for the adaptive controller
For a detailed analysis of the different KL approximation methods (k1, k2, k3), see Approximating KL Divergence by John Schulman.

Running PPO

Minimal Example

The following command fine-tunes Qwen2.5-0.5B-Instruct on GSM8K with PPO using GAE advantages:
python3 -m verl.trainer.main_ppo \
    algorithm=ppo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=1024 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    algorithm.adv_estimator=gae \
    trainer.total_epochs=15

Using the Canonical Script

For a fully-featured run with sensible defaults for Qwen3-8B on GSM8K + MATH:
bash examples/ppo_trainer/run_qwen3_8b_fsdp.sh
The script exposes the most commonly tuned knobs as environment variables:
MODEL_PATH=Qwen/Qwen3-8B \
TRAIN_BATCH_SIZE=1024 \
PPO_MINI_BATCH_SIZE=256 \
TOTAL_EPOCHS=15 \
bash examples/ppo_trainer/run_qwen3_8b_fsdp.sh

Megatron-LM Backend

For large models that require tensor parallelism, use the Megatron-LM training script:
bash examples/ppo_trainer/run_qwen3_8b_megatron.sh

Advanced Options

FSDP2 Training Strategy

Switch to the FSDP2 sharding strategy by setting:
actor_rollout_ref.actor.strategy=fsdp2
FSDP2 provides improved memory efficiency and supports per-parameter sharding, which can be beneficial for very large models.

CPU Offload for Gradient Accumulation

When GPU memory is constrained, you can offload parameters and optimiser states to CPU:
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True

Dual-Clip PPO

Standard PPO only clips the probability ratio from above. Dual-Clip PPO adds a lower bound on the ratio when the advantage is negative, preventing overly conservative updates in the negative-advantage case.
actor_rollout_ref.actor.clip_ratio_c=3.0   # lower bound, default 3.0
The objective becomes:
pg_losses1 = -advantages * ratio
pg_losses2 = -advantages * clamp(ratio, 1 - ε, 1 + ε)
loss       = max(pg_losses1, pg_losses2, -clip_ratio_c * advantages)  # when adv < 0

Entropy Regularisation

A small entropy bonus encourages exploration and can prevent premature convergence:
actor_rollout_ref.actor.entropy_coeff=0.01
The default value of actor_rollout_ref.actor.entropy_coeff was changed from a non-zero value to 0.0 in verl 0.3.x (2025-05-30). If you are comparing with results from older checkpoints, verify the entropy coefficient used.

Reference Performance

The table below shows validated GSM8K test scores using verl v0.2.
ModelMethodGSM8K ScoreNotes
Qwen/Qwen2.5-0.5B-InstructPretrained36.4Baseline
Qwen/Qwen2.5-0.5B-InstructPPO56.7Training log
deepseek-ai/deepseek-llm-7b-chatPPO (Megatron)69.5Training log
deepseek-ai/deepseek-llm-7b-chatPPO (AMD MI300)70.5Training log
For comprehensive baselines across models, datasets, and algorithms, see the Baselines page.

Build docs developers (and LLMs) love