verl is not limited to PPO and GRPO. The framework ships supervised fine-tuning, several critic-free RL variants, and community-contributed self-play recipes — all built on the same infrastructure and configurable through the same Hydra config system. This page covers each algorithm, its key configuration, and how to run it.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
Algorithm Comparison
The table below gives a quick overview of every algorithm available in verl to help you choose the right one for your task.| Algorithm | Needs Critic | Sample Efficiency | Memory Use | Typical Use Case |
|---|---|---|---|---|
| SFT | No | High (supervised) | Low | Warm-up before RL, instruction tuning |
| PPO | Yes | High | High (actor + critic) | Stable RL fine-tuning with learned value function |
| GRPO | No | Medium | Medium | Math / code RL without critic overhead |
| DAPO | No | High | Medium | SOTA open RL, long-CoT reasoning |
| RLOO | No | Medium | Low | Simple critic-free baseline |
| ReMax | No | Medium | Low | Variance reduction without critic |
| REINFORCE++ | No | Medium | Low | Improved REINFORCE with variance reduction |
| SPIN | No | Medium | Medium | Self-play alignment, no human preference data |
| SPPO | No | Medium | Medium | Nash equilibrium alignment via self-play |
Supervised Fine-Tuning (SFT)
SFT: Warm-Up and Instruction Tuning
SFT: Warm-Up and Instruction Tuning
Supervised fine-tuning in verl uses a dedicated
Example:Reference performance (GSM8K):
sft_trainer that is launched with torchrun for SPMD (Single-Program Multiple-Data) distributed training. SFT is commonly used as a warm-up step before RL training, or on its own for instruction following and domain adaptation.Entry point: verl.trainer.sft_trainerKey configuration parameters:| Parameter | Description |
|---|---|
data.train_files | Path(s) to training parquet files |
data.val_files | Path(s) to validation parquet files |
data.prompt_key | Column name for the prompt in the parquet file |
data.response_key | Column name for the response in the parquet file |
model.partial_pretrain | HuggingFace model path or local checkpoint |
trainer.total_epochs | Number of training epochs |
| Model | Method | Score |
|---|---|---|
| google/gemma-2-2b-it | HF checkpoint | 23.9 |
| google/gemma-2-2b-it | SFT | 52.06 |
| google/gemma-2-2b-it | SFT + PPO | 64.02 |
DAPO
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization
DAPO is a state-of-the-art open-source RL algorithm developed by ByteDance and Tsinghua University. Applying DAPO to the Qwen2.5-32B base model achieves 50% accuracy on AIME 2024, outperforming DeepSeek-R1-Zero-Qwen-32B with 50% fewer training steps. verl is the reference framework for DAPO.DAPO makes four key contributions on top of GRPO:Key configuration snippets:Reproduction results (AIME 2024):
- Decoupled clip ratios (Clip-Higher): Separate ε values for positive and negative advantages, allowing more aggressive updates when the policy improves.
- Dynamic sampling (group filtering): Groups where all responses succeed or all fail are discarded and resampled, ensuring every gradient step trains on informative signal.
- Token-level loss (flexible aggregation): Uses
token-meanloss aggregation rather than sequence-level averaging. - Overlong reward shaping: A linear penalty is applied to outputs that exceed a configurable soft length limit, encouraging concise reasoning.
| Setup | AIME 2024 Acc. | Hardware |
|---|---|---|
| DAPO | 52% | 16×8×H800 |
| DAPO w/o Dynamic Sampling | 50% | 16×8×H800 |
| DAPO w/o Token-level Loss & Dynamic Sampling | 44% | 16×8×H20 |
The
recipe/dapo/ directory in the main branch is the actively maintained version that tracks new verl features. The recipe/dapo branch is frozen for as-is reproduction of the original paper results.RLOO (REINFORCE Leave-One-Out)
RLOO: Critic-Free Baseline via Leave-One-Out
RLOO: Critic-Free Baseline via Leave-One-Out
RLOO (REINFORCE Leave-One-Out) is a critic-free algorithm that computes a per-sample baseline by averaging the rewards of all other responses in the same group, rather than training a dedicated value network. This is mathematically equivalent to a jackknife estimate of the group mean and typically achieves lower variance than plain REINFORCE while remaining simpler than PPO.Enable RLOO by setting:Because RLOO is critic-free, you do not need to provide Reference scripts:
critic.model.path or any critic configuration.Example:examples/rloo_trainer/Reference performance (GSM8K):| Model | Method | Score |
|---|---|---|
| Mixtral-8x22B-Instruct-v0.1 | RLOO (Megatron) | 92.3 |
ReMax
ReMax: Greedy Rollout as Variance-Reduction Baseline
ReMax: Greedy Rollout as Variance-Reduction Baseline
ReMax replaces the critic with a greedy rollout baseline: for each prompt, one deterministic (greedy) response is generated alongside the stochastic rollouts. The greedy response reward serves as the per-prompt baseline, reducing variance without requiring a trained value function.Enable ReMax by setting:Reference scripts:
examples/remax_trainer/Reference performance (GSM8K):| Model | Method | Score |
|---|---|---|
| Qwen/Qwen2.5-7B-Instruct | ReMax | 97 |
REINFORCE++
REINFORCE++: Improved REINFORCE with Variance Reduction
REINFORCE++: Improved REINFORCE with Variance Reduction
REINFORCE++ improves on plain REINFORCE by applying several variance-reduction techniques — including token-level reward normalisation and a whitened advantage estimate — without introducing a critic model. It is a strong critic-free baseline when you want something between plain REINFORCE and full PPO.Enable REINFORCE++ by setting:A baseline variant that subtracts a running mean of rewards is also available:Both estimators are compatible with the standard
verl.trainer.main_ppo entry point and accept all the same actor configuration parameters as PPO.SPIN (Self-Play Fine-Tuning)
SPIN: Iterative Self-Play Without Human Preferences
SPIN: Iterative Self-Play Without Human Preferences
SPIN (Self-Play Fine-Tuning) enables iterative self-improvement through an online DPO training loop, without requiring external preference datasets or a stronger teacher model.Core idea: In each iteration, the current model generates responses and the training objective is to distinguish the current model’s outputs from the previous iteration’s outputs — a two-player game where the LLM plays both roles.Key implementation details in verl:Key configuration parameters:
Reference performance (GSM8K):
Recipe location:
- No critic model is used.
- An explicit reference policy (
ref_policy_wg) provides the DPO baseline, with weights updated from the actor at a configurable frequency (trainer.ref_update_freq). - Online preference pairs are generated dynamically using rule-based reward ranking (e.g., selecting the better answer for math problems).
- The DPO loss (
compute_online_dpo_loss) replaces the PPO surrogate in the actor update.
| Parameter | Description |
|---|---|
algorithm.dpo_beta | DPO regularisation coefficient |
algorithm.dpo_loss_type | DPO loss variant |
trainer.ref_update_freq | How often (in steps) the reference model weights are updated from the actor. Set to 0 to disable. |
| Model | Method | Score |
|---|---|---|
| Qwen/Qwen2.5-7B-Instruct | SPIN | 92 |
recipe/spin/SPPO (Self-Play Preference Optimization)
SPPO: Nash Equilibrium Alignment via Self-Play
SPPO: Nash Equilibrium Alignment via Self-Play
SPPO (Self-Play Preference Optimization) frames LLM alignment as finding the Nash equilibrium of a two-player game, allowing the model to improve without strong external signals such as GPT-4 responses. SPPO is theoretically grounded, converges to the von Neumann winner under general (potentially intransitive) preference relations, and empirically outperforms iterative DPO on several benchmarks.Running SPPO:Reference performance (MATH dataset):
Recipe location:
| Model | Method | MATH Score |
|---|---|---|
| Qwen/Qwen2.5-7B-Instruct | HF checkpoint | 46.6 |
| Qwen/Qwen2.5-7B-Instruct | SPPO (20 epochs) | 65.6 |
recipe/sppo/verl’s internal evaluation metrics may not perfectly align with the official Qwen2.5-7B-Instruct evaluation methodology. The scores above are reported under verl’s evaluation framework for consistency.
Entropy Mechanism (Clip-Cov / KL-Cov)
Entropy Mechanism: Preventing Entropy Collapse in RL
Entropy Mechanism: Preventing Entropy Collapse in RL
Policy entropy drops sharply during RL training, causing overconfidence and performance saturation — a phenomenon known as entropy collapse. The Entropy Mechanism paper (arXiv:2505.22617) establishes an empirical relationship between entropy H and task performance R: R = −a·exp(H) + b, showing that performance is directly bottlenecked by entropy exhaustion.Two strategies are proposed to mitigate this:Benchmark results (Qwen2.5-7B):
- Clip-Cov: Restricts gradient updates for tokens with high covariance between action probability and logit updates.
- KL-Cov: Applies a KL penalty weighted by the same covariance term.
recipe/dapo/ scripts:| Method | AIME24 | AIME25 | MATH-500 | Avg |
|---|---|---|---|---|
| GRPO | 21.2 | 9.6 | 78.8 | 38.6 |
| + Clip-Cov | 22.1 | 15.8 | 80.4 | 40.4 |
| + KL-Cov | 22.6 | 12.9 | 80.8 | 40.6 |