Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl is not limited to PPO and GRPO. The framework ships supervised fine-tuning, several critic-free RL variants, and community-contributed self-play recipes — all built on the same infrastructure and configurable through the same Hydra config system. This page covers each algorithm, its key configuration, and how to run it.

Algorithm Comparison

The table below gives a quick overview of every algorithm available in verl to help you choose the right one for your task.
AlgorithmNeeds CriticSample EfficiencyMemory UseTypical Use Case
SFTNoHigh (supervised)LowWarm-up before RL, instruction tuning
PPOYesHighHigh (actor + critic)Stable RL fine-tuning with learned value function
GRPONoMediumMediumMath / code RL without critic overhead
DAPONoHighMediumSOTA open RL, long-CoT reasoning
RLOONoMediumLowSimple critic-free baseline
ReMaxNoMediumLowVariance reduction without critic
REINFORCE++NoMediumLowImproved REINFORCE with variance reduction
SPINNoMediumMediumSelf-play alignment, no human preference data
SPPONoMediumMediumNash equilibrium alignment via self-play

Supervised Fine-Tuning (SFT)

Supervised fine-tuning in verl uses a dedicated sft_trainer that is launched with torchrun for SPMD (Single-Program Multiple-Data) distributed training. SFT is commonly used as a warm-up step before RL training, or on its own for instruction following and domain adaptation.Entry point: verl.trainer.sft_trainerKey configuration parameters:
ParameterDescription
data.train_filesPath(s) to training parquet files
data.val_filesPath(s) to validation parquet files
data.prompt_keyColumn name for the prompt in the parquet file
data.response_keyColumn name for the response in the parquet file
model.partial_pretrainHuggingFace model path or local checkpoint
trainer.total_epochsNumber of training epochs
Example:
torchrun --standalone --nnodes=1 --nproc_per_node=8 \
    -m verl.trainer.sft_trainer \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.prompt_key=prompt \
    data.response_key=response \
    model.partial_pretrain=Qwen/Qwen2.5-0.5B-Instruct \
    trainer.total_epochs=5
Reference performance (GSM8K):
ModelMethodScore
google/gemma-2-2b-itHF checkpoint23.9
google/gemma-2-2b-itSFT52.06
google/gemma-2-2b-itSFT + PPO64.02

DAPO

DAPO is a state-of-the-art open-source RL algorithm developed by ByteDance and Tsinghua University. Applying DAPO to the Qwen2.5-32B base model achieves 50% accuracy on AIME 2024, outperforming DeepSeek-R1-Zero-Qwen-32B with 50% fewer training steps. verl is the reference framework for DAPO.DAPO makes four key contributions on top of GRPO:
  1. Decoupled clip ratios (Clip-Higher): Separate ε values for positive and negative advantages, allowing more aggressive updates when the policy improves.
  2. Dynamic sampling (group filtering): Groups where all responses succeed or all fail are discarded and resampled, ensuring every gradient step trains on informative signal.
  3. Token-level loss (flexible aggregation): Uses token-mean loss aggregation rather than sequence-level averaging.
  4. Overlong reward shaping: A linear penalty is applied to outputs that exceed a configurable soft length limit, encouraging concise reasoning.
Quickstart:
# Step 1 — prepare data on the Ray cluster
bash prepare_dapo_data.sh   # downloads to ${HOME}/verl/data by default

# Step 2 — submit job from any machine
cd verl
export RAY_ADDRESS="http://${RAY_IP:-localhost}:8265"
export WORKING_DIR="${PWD}"
export RUNTIME_ENV="./recipe/dapo/runtime_env.yaml"
bash recipe/dapo/run_dapo_qwen2.5_32b.sh
Key configuration snippets:
# Decoupled clip ratios
actor_rollout_ref:
  actor:
    clip_ratio_low: 0.2
    clip_ratio_high: 0.28

# Dynamic sampling
data:
  gen_batch_size: 1536
  train_batch_size: 512
algorithm:
  filter_groups:
    enable: True
    metric: acc          # filter on accuracy
    max_num_gen_batches: 10

# Overlong reward shaping
data:
  max_response_length: 20480   # 16384 + 4096 buffer
reward_model:
  overlong_buffer:
    enable: True
    len: 4096
    penalty_factor: 1.0
Reproduction results (AIME 2024):
SetupAIME 2024 Acc.Hardware
DAPO52%16×8×H800
DAPO w/o Dynamic Sampling50%16×8×H800
DAPO w/o Token-level Loss & Dynamic Sampling44%16×8×H20
The recipe/dapo/ directory in the main branch is the actively maintained version that tracks new verl features. The recipe/dapo branch is frozen for as-is reproduction of the original paper results.

RLOO (REINFORCE Leave-One-Out)

RLOO (REINFORCE Leave-One-Out) is a critic-free algorithm that computes a per-sample baseline by averaging the rewards of all other responses in the same group, rather than training a dedicated value network. This is mathematically equivalent to a jackknife estimate of the group mean and typically achieves lower variance than plain REINFORCE while remaining simpler than PPO.Enable RLOO by setting:
algorithm.adv_estimator=rloo
Because RLOO is critic-free, you do not need to provide critic.model.path or any critic configuration.Example:
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=rloo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.rollout.n=8 \
    trainer.total_epochs=15
Reference scripts: examples/rloo_trainer/Reference performance (GSM8K):
ModelMethodScore
Mixtral-8x22B-Instruct-v0.1RLOO (Megatron)92.3

ReMax

ReMax replaces the critic with a greedy rollout baseline: for each prompt, one deterministic (greedy) response is generated alongside the stochastic rollouts. The greedy response reward serves as the per-prompt baseline, reducing variance without requiring a trained value function.Enable ReMax by setting:
algorithm.adv_estimator=remax
Reference scripts: examples/remax_trainer/Reference performance (GSM8K):
ModelMethodScore
Qwen/Qwen2.5-7B-InstructReMax97

REINFORCE++

REINFORCE++ improves on plain REINFORCE by applying several variance-reduction techniques — including token-level reward normalisation and a whitened advantage estimate — without introducing a critic model. It is a strong critic-free baseline when you want something between plain REINFORCE and full PPO.Enable REINFORCE++ by setting:
algorithm.adv_estimator=reinforce_plus_plus
A baseline variant that subtracts a running mean of rewards is also available:
algorithm.adv_estimator=reinforce_plus_plus_baseline
Both estimators are compatible with the standard verl.trainer.main_ppo entry point and accept all the same actor configuration parameters as PPO.

SPIN (Self-Play Fine-Tuning)

SPIN (Self-Play Fine-Tuning) enables iterative self-improvement through an online DPO training loop, without requiring external preference datasets or a stronger teacher model.Core idea: In each iteration, the current model generates responses and the training objective is to distinguish the current model’s outputs from the previous iteration’s outputs — a two-player game where the LLM plays both roles.Key implementation details in verl:
  • No critic model is used.
  • An explicit reference policy (ref_policy_wg) provides the DPO baseline, with weights updated from the actor at a configurable frequency (trainer.ref_update_freq).
  • Online preference pairs are generated dynamically using rule-based reward ranking (e.g., selecting the better answer for math problems).
  • The DPO loss (compute_online_dpo_loss) replaces the PPO surrogate in the actor update.
Running SPIN:
# Prepare data and model
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
hf download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct

# Launch training
export CUDA_VISIBLE_DEVICES=0,1,2,3
bash recipe/spin/run_spin.sh
Key configuration parameters:
ParameterDescription
algorithm.dpo_betaDPO regularisation coefficient
algorithm.dpo_loss_typeDPO loss variant
trainer.ref_update_freqHow often (in steps) the reference model weights are updated from the actor. Set to 0 to disable.
Reference performance (GSM8K):
ModelMethodScore
Qwen/Qwen2.5-7B-InstructSPIN92
Recipe location: recipe/spin/

SPPO (Self-Play Preference Optimization)

SPPO (Self-Play Preference Optimization) frames LLM alignment as finding the Nash equilibrium of a two-player game, allowing the model to improve without strong external signals such as GPT-4 responses. SPPO is theoretically grounded, converges to the von Neumann winner under general (potentially intransitive) preference relations, and empirically outperforms iterative DPO on several benchmarks.Running SPPO:
# Install verl
python3 -m uv pip install -e "[sglang]"

# Prepare data and model
python3 examples/data_preprocess/math_dataset.py --local_dir ~/data/math
hf download Qwen/Qwen2.5-7B-Instruct --local-dir $HOME/models/Qwen2.5-7B-Instruct

# Launch training
export CUDA_VISIBLE_DEVICES=0,1,2,3
bash recipe/sppo/run_qwen2.5-7b_rm.sh
Reference performance (MATH dataset):
ModelMethodMATH Score
Qwen/Qwen2.5-7B-InstructHF checkpoint46.6
Qwen/Qwen2.5-7B-InstructSPPO (20 epochs)65.6
Recipe location: recipe/sppo/
verl’s internal evaluation metrics may not perfectly align with the official Qwen2.5-7B-Instruct evaluation methodology. The scores above are reported under verl’s evaluation framework for consistency.

Entropy Mechanism (Clip-Cov / KL-Cov)

Policy entropy drops sharply during RL training, causing overconfidence and performance saturation — a phenomenon known as entropy collapse. The Entropy Mechanism paper (arXiv:2505.22617) establishes an empirical relationship between entropy H and task performance R: R = −a·exp(H) + b, showing that performance is directly bottlenecked by entropy exhaustion.Two strategies are proposed to mitigate this:
  • Clip-Cov: Restricts gradient updates for tokens with high covariance between action probability and logit updates.
  • KL-Cov: Applies a KL penalty weighted by the same covariance term.
Both methods are implemented as extensions of GRPO and can be activated via the recipe/dapo/ scripts:
# KL-Cov on Qwen2.5-7B (single node)
bash recipe/dapo/7b_kl_cov.sh

# KL-Cov on Qwen2.5-32B (multi-node)
bash recipe/dapo/32b_kl_cov.sh
Benchmark results (Qwen2.5-7B):
MethodAIME24AIME25MATH-500Avg
GRPO21.29.678.838.6
+ Clip-Cov22.115.880.440.4
+ KL-Cov22.612.980.840.6

Build docs developers (and LLMs) love