SFT, DAPO, RLOO, ReMax and Other verl Algorithms

verl is not limited to PPO and GRPO. The framework ships supervised fine-tuning, several critic-free RL variants, and community-contributed self-play recipes — all built on the same infrastructure and configurable through the same Hydra config system. This page covers each algorithm, its key configuration, and how to run it.

Algorithm Comparison

The table below gives a quick overview of every algorithm available in verl to help you choose the right one for your task.

Algorithm	Needs Critic	Sample Efficiency	Memory Use	Typical Use Case
SFT	No	High (supervised)	Low	Warm-up before RL, instruction tuning
PPO	Yes	High	High (actor + critic)	Stable RL fine-tuning with learned value function
GRPO	No	Medium	Medium	Math / code RL without critic overhead
DAPO	No	High	Medium	SOTA open RL, long-CoT reasoning
RLOO	No	Medium	Low	Simple critic-free baseline
ReMax	No	Medium	Low	Variance reduction without critic
REINFORCE++	No	Medium	Low	Improved REINFORCE with variance reduction
SPIN	No	Medium	Medium	Self-play alignment, no human preference data
SPPO	No	Medium	Medium	Nash equilibrium alignment via self-play

Supervised Fine-Tuning (SFT)

SFT: Warm-Up and Instruction Tuning

Supervised fine-tuning in verl uses a dedicated sft_trainer that is launched with torchrun for SPMD (Single-Program Multiple-Data) distributed training. SFT is commonly used as a warm-up step before RL training, or on its own for instruction following and domain adaptation.Entry point: verl.trainer.sft_trainerKey configuration parameters:

Parameter	Description
`data.train_files`	Path(s) to training parquet files
`data.val_files`	Path(s) to validation parquet files
`data.prompt_key`	Column name for the prompt in the parquet file
`data.response_key`	Column name for the response in the parquet file
`model.partial_pretrain`	HuggingFace model path or local checkpoint
`trainer.total_epochs`	Number of training epochs

Example:

torchrun --standalone --nnodes=1 --nproc_per_node=8 \
    -m verl.trainer.sft_trainer \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.prompt_key=prompt \
    data.response_key=response \
    model.partial_pretrain=Qwen/Qwen2.5-0.5B-Instruct \
    trainer.total_epochs=5

Reference performance (GSM8K):

Model	Method	Score
google/gemma-2-2b-it	HF checkpoint	23.9
google/gemma-2-2b-it	SFT	52.06
google/gemma-2-2b-it	SFT + PPO	64.02

DAPO

DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization

DAPO is a state-of-the-art open-source RL algorithm developed by ByteDance and Tsinghua University. Applying DAPO to the Qwen2.5-32B base model achieves 50% accuracy on AIME 2024, outperforming DeepSeek-R1-Zero-Qwen-32B with 50% fewer training steps. verl is the reference framework for DAPO.DAPO makes four key contributions on top of GRPO:

Decoupled clip ratios (Clip-Higher): Separate ε values for positive and negative advantages, allowing more aggressive updates when the policy improves.
Dynamic sampling (group filtering): Groups where all responses succeed or all fail are discarded and resampled, ensuring every gradient step trains on informative signal.
Token-level loss (flexible aggregation): Uses token-mean loss aggregation rather than sequence-level averaging.
Overlong reward shaping: A linear penalty is applied to outputs that exceed a configurable soft length limit, encouraging concise reasoning.

Quickstart:

# Step 1 — prepare data on the Ray cluster
bash prepare_dapo_data.sh   # downloads to ${HOME}/verl/data by default

# Step 2 — submit job from any machine
cd verl
export RAY_ADDRESS="http://${RAY_IP:-localhost}:8265"
export WORKING_DIR="${PWD}"
export RUNTIME_ENV="./recipe/dapo/runtime_env.yaml"
bash recipe/dapo/run_dapo_qwen2.5_32b.sh

Key configuration snippets:

# Decoupled clip ratios
actor_rollout_ref:
  actor:
    clip_ratio_low: 0.2
    clip_ratio_high: 0.28

# Dynamic sampling
data:
  gen_batch_size: 1536
  train_batch_size: 512
algorithm:
  filter_groups:
    enable: True
    metric: acc          # filter on accuracy
    max_num_gen_batches: 10

# Overlong reward shaping
data:
  max_response_length: 20480   # 16384 + 4096 buffer
reward_model:
  overlong_buffer:
    enable: True
    len: 4096
    penalty_factor: 1.0

Reproduction results (AIME 2024):

Setup	AIME 2024 Acc.	Hardware
DAPO	52%	16×8×H800
DAPO w/o Dynamic Sampling	50%	16×8×H800
DAPO w/o Token-level Loss & Dynamic Sampling	44%	16×8×H20

The recipe/dapo/ directory in the main branch is the actively maintained version that tracks new verl features. The recipe/dapo branch is frozen for as-is reproduction of the original paper results.

RLOO (REINFORCE Leave-One-Out)

RLOO: Critic-Free Baseline via Leave-One-Out

RLOO (REINFORCE Leave-One-Out) is a critic-free algorithm that computes a per-sample baseline by averaging the rewards of all other responses in the same group, rather than training a dedicated value network. This is mathematically equivalent to a jackknife estimate of the group mean and typically achieves lower variance than plain REINFORCE while remaining simpler than PPO.Enable RLOO by setting:

algorithm.adv_estimator=rloo

Because RLOO is critic-free, you do not need to provide critic.model.path or any critic configuration.Example:

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=rloo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.rollout.n=8 \
    trainer.total_epochs=15

Reference scripts: examples/rloo_trainer/Reference performance (GSM8K):

Model	Method	Score
Mixtral-8x22B-Instruct-v0.1	RLOO (Megatron)	92.3

ReMax

ReMax: Greedy Rollout as Variance-Reduction Baseline

ReMax replaces the critic with a greedy rollout baseline: for each prompt, one deterministic (greedy) response is generated alongside the stochastic rollouts. The greedy response reward serves as the per-prompt baseline, reducing variance without requiring a trained value function.Enable ReMax by setting:

algorithm.adv_estimator=remax

Reference scripts: examples/remax_trainer/Reference performance (GSM8K):

Model	Method	Score
Qwen/Qwen2.5-7B-Instruct	ReMax	97

REINFORCE++

REINFORCE++: Improved REINFORCE with Variance Reduction

REINFORCE++ improves on plain REINFORCE by applying several variance-reduction techniques — including token-level reward normalisation and a whitened advantage estimate — without introducing a critic model. It is a strong critic-free baseline when you want something between plain REINFORCE and full PPO.Enable REINFORCE++ by setting:

algorithm.adv_estimator=reinforce_plus_plus

A baseline variant that subtracts a running mean of rewards is also available:

algorithm.adv_estimator=reinforce_plus_plus_baseline

Both estimators are compatible with the standard verl.trainer.main_ppo entry point and accept all the same actor configuration parameters as PPO.

SPIN (Self-Play Fine-Tuning)

SPIN: Iterative Self-Play Without Human Preferences

SPIN (Self-Play Fine-Tuning) enables iterative self-improvement through an online DPO training loop, without requiring external preference datasets or a stronger teacher model.Core idea: In each iteration, the current model generates responses and the training objective is to distinguish the current model’s outputs from the previous iteration’s outputs — a two-player game where the LLM plays both roles.Key implementation details in verl:

No critic model is used.
An explicit reference policy (ref_policy_wg) provides the DPO baseline, with weights updated from the actor at a configurable frequency (trainer.ref_update_freq).
Online preference pairs are generated dynamically using rule-based reward ranking (e.g., selecting the better answer for math problems).
The DPO loss (compute_online_dpo_loss) replaces the PPO surrogate in the actor update.

Running SPIN:

# Prepare data and model
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
hf download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct

# Launch training
export CUDA_VISIBLE_DEVICES=0,1,2,3
bash recipe/spin/run_spin.sh

Key configuration parameters:

Parameter	Description
`algorithm.dpo_beta`	DPO regularisation coefficient
`algorithm.dpo_loss_type`	DPO loss variant
`trainer.ref_update_freq`	How often (in steps) the reference model weights are updated from the actor. Set to 0 to disable.

Reference performance (GSM8K):

Model	Method	Score
Qwen/Qwen2.5-7B-Instruct	SPIN	92

Recipe location: recipe/spin/

SPPO (Self-Play Preference Optimization)

SPPO: Nash Equilibrium Alignment via Self-Play

SPPO (Self-Play Preference Optimization) frames LLM alignment as finding the Nash equilibrium of a two-player game, allowing the model to improve without strong external signals such as GPT-4 responses. SPPO is theoretically grounded, converges to the von Neumann winner under general (potentially intransitive) preference relations, and empirically outperforms iterative DPO on several benchmarks.Running SPPO:

# Install verl
python3 -m uv pip install -e "[sglang]"

# Prepare data and model
python3 examples/data_preprocess/math_dataset.py --local_dir ~/data/math
hf download Qwen/Qwen2.5-7B-Instruct --local-dir $HOME/models/Qwen2.5-7B-Instruct

# Launch training
export CUDA_VISIBLE_DEVICES=0,1,2,3
bash recipe/sppo/run_qwen2.5-7b_rm.sh

Reference performance (MATH dataset):

Model	Method	MATH Score
Qwen/Qwen2.5-7B-Instruct	HF checkpoint	46.6
Qwen/Qwen2.5-7B-Instruct	SPPO (20 epochs)	65.6

Recipe location: recipe/sppo/

verl’s internal evaluation metrics may not perfectly align with the official Qwen2.5-7B-Instruct evaluation methodology. The scores above are reported under verl’s evaluation framework for consistency.

Entropy Mechanism (Clip-Cov / KL-Cov)

Entropy Mechanism: Preventing Entropy Collapse in RL

Policy entropy drops sharply during RL training, causing overconfidence and performance saturation — a phenomenon known as entropy collapse. The Entropy Mechanism paper (arXiv:2505.22617) establishes an empirical relationship between entropy H and task performance R: R = −a·exp(H) + b, showing that performance is directly bottlenecked by entropy exhaustion.Two strategies are proposed to mitigate this:

Clip-Cov: Restricts gradient updates for tokens with high covariance between action probability and logit updates.
KL-Cov: Applies a KL penalty weighted by the same covariance term.

Both methods are implemented as extensions of GRPO and can be activated via the recipe/dapo/ scripts:

# KL-Cov on Qwen2.5-7B (single node)
bash recipe/dapo/7b_kl_cov.sh

# KL-Cov on Qwen2.5-32B (multi-node)
bash recipe/dapo/32b_kl_cov.sh

Benchmark results (Qwen2.5-7B):

Method	AIME24	AIME25	MATH-500	Avg
GRPO	21.2	9.6	78.8	38.6
+ Clip-Cov	22.1	15.8	80.4	40.4
+ KL-Cov	22.6	12.9	80.8	40.6

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

SFT, DAPO, RLOO, ReMax and Other verl Algorithms

Algorithm Comparison

Supervised Fine-Tuning (SFT)

DAPO

RLOO (REINFORCE Leave-One-Out)

ReMax

REINFORCE++

SPIN (Self-Play Fine-Tuning)

SPPO (Self-Play Preference Optimization)

Entropy Mechanism (Clip-Cov / KL-Cov)

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Algorithm Comparison

​Supervised Fine-Tuning (SFT)

​DAPO

​RLOO (REINFORCE Leave-One-Out)

​ReMax

​REINFORCE++

​SPIN (Self-Play Fine-Tuning)

​SPPO (Self-Play Preference Optimization)

​Entropy Mechanism (Clip-Cov / KL-Cov)

Build docs developers (and LLMs) love

Algorithm Comparison

Supervised Fine-Tuning (SFT)

DAPO

RLOO (REINFORCE Leave-One-Out)

ReMax

REINFORCE++

SPIN (Self-Play Fine-Tuning)

SPPO (Self-Play Preference Optimization)

Entropy Mechanism (Clip-Cov / KL-Cov)