verl Algorithm Performance Baselines and Benchmarks

verl maintains a set of reproducible baselines to help you calibrate expected performance before committing to a full training run, and to verify that your environment is configured correctly. Every result in this page was produced with verl and links to the exact training command and log file used.

The default value of actor_rollout_ref.actor.entropy_coeff was changed from a non-zero value to 0.0 in verl 0.3.x (2025-05-30). Results from earlier versions may differ if entropy regularisation was active by default.

Reproducing a Baseline

Preprocess the dataset

Download and tokenise the dataset into the parquet format expected by verl’s data loaders:

# GSM8K
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

# MATH (for SFT or GRPO on MATH)
python3 examples/data_preprocess/math_dataset.py --local_dir ~/data/math

# DAPO-Math-17k (for DAPO / AIME experiments)
bash prepare_dapo_data.sh

Run a canonical training script

Each algorithm ships a ready-to-use shell script. Pick the one that matches your model and backend:

# PPO — Qwen3-8B, FSDP
bash examples/ppo_trainer/run_qwen3_8b_fsdp.sh

# GRPO — Qwen3-8B, FSDP
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

# GRPO — Qwen3-8B, Megatron-LM
bash examples/grpo_trainer/run_qwen3_8b_megatron.sh

You can override the most commonly tuned knobs via environment variables:

MODEL_PATH=Qwen/Qwen3-8B \
TRAIN_BATCH_SIZE=1024 \
ROLLOUT_N=8 \
TOTAL_EPOCHS=15 \
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

Monitor training

All canonical scripts log to both the console and Weights & Biases. Set your API key before launching:

export WANDB_API_KEY=<your_key>

The project name defaults to verl_ppo_gsm8k_math (PPO) or verl_grpo_gsm8k_math (GRPO).

GSM8K Baselines

The table below covers results on the GSM8K test set. All runs use NVIDIA GPUs unless otherwise noted.

Results marked [1] used strict "####" answer extraction during evaluation. A more flexible extraction method, longer response length, or improved prompting may yield higher scores.

Small Models (≤ 3B)

Hardware	Model	Method	Test Score	Reference
NVIDIA GPU	google/gemma-2-2b-it	HF checkpoint	23.9	HuggingFace
NVIDIA GPU	google/gemma-2-2b-it	SFT	52.06	Log
NVIDIA GPU	google/gemma-2-2b-it	SFT + PPO	64.02	Log
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	HF checkpoint	49.6	Qwen Blog
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	PPO	56.7	Log
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	PRIME	58.7	Script
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	GRPO-LoRA	54.3	Log
NVIDIA GPU	Qwen/Qwen2.5-1.5B-Instruct	GRPO-LoRA	77.9	Log
NVIDIA GPU	Qwen/Qwen2.5-3B-Instruct	GRPO-LoRA	86.1	Log

Mid-Size Models (7B)

Hardware	Model	Method	Test Score	Reference
NVIDIA GPU	deepseek-ai/deepseek-llm-7b-chat	PPO (Megatron)	69.5 [1]	Log
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO	89	Script
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (FSDP2)	89.8	Log
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (Megatron)	89.6	Log
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GPG	88.0	Log
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GPG (Megatron)	88.0	Log
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	ReMax	97	Script
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	GRPO-LoRA	93.4	Log
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	SPIN	92	Script
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	SPPO	65.6 (MATH)	Script
AMD MI300	deepseek-ai/deepseek-llm-7b-chat	PPO	70.5 [1]	Log
AMD MI300	deepseek-ai/deepseek-llm-7b-chat	GRPO	71.4 [1]	Log

Large Models (14B – 72B)

Hardware	Model	Method	Test Score	Reference
NVIDIA GPU	Mixtral-8x22B-Instruct-v0.1	Instruct model	83.7	Qwen Blog
NVIDIA GPU	Mixtral-8x22B-Instruct-v0.1	RLOO (Megatron)	92.3	W&B
NVIDIA GPU	Qwen/Qwen2.5-14B-Instruct	GRPO-LoRA	94.6	Log
NVIDIA GPU	Qwen/Qwen2.5-32B-Instruct	GRPO-LoRA	95.8	Log
NVIDIA GPU	Qwen/Qwen2.5-72B-Instruct	GRPO-LoRA	96.0	Log

Reproducing results for 14B+ models requires multiple high-memory GPUs (A100 80GB or H100). The LoRA variants are more accessible — GRPO-LoRA results above use LoRA rank 32 and can be run on 4–8× A100 40GB.

DAPO Baselines (AIME 2024)

DAPO experiments use the DAPO-Math-17k training set and the AIME-2024 test set.

For Qwen/Qwen2.5-Math-7B, the max_position_embeddings was extended to 32768 without observed performance degradation in order to accommodate longer response lengths.

Hardware	Model	Method	AIME 2024 Acc.	Reference
NVIDIA GPU	Qwen/Qwen2.5-Math-7B (32k)	DAPO	36.3	W&B
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	DAPO + Code Interpreter	40.0	Script

The full DAPO-32B run requires 16 nodes × 8 × H800 GPUs. See recipe/dapo/run_dapo_qwen2.5_32b.sh for the exact launch configuration.

Coding Baselines (LeetCode)

Hardware	Model	Method	LeetCode Score	Reference
NVIDIA GPU	PRIME-RL/Eurus-2-7B-SFT	PRIME	36.1	Script

Vision-Language Baselines

Hardware	Model	Method	Dataset	Score	Reference
NVIDIA GPU	Qwen/Qwen2.5-VL-7B-Instruct	GRPO (Megatron)	GEO3k	65.4	Script

More Comprehensive Benchmarks

The results on this page cover the most commonly reproduced configurations. For a broader set of benchmarks — including additional model families, multi-node scale runs, and ablation studies — see the recipe/ directory in the verl-recipe repository.

PPO Configuration Guide

Full reference for PPO hyperparameters, KL control, Dual-Clip, and training backends.

GRPO Configuration Guide

Configure group size, KL loss type, loss aggregation, and DrGRPO length-bias correction.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

verl Algorithm Performance Baselines and Benchmarks

Reproducing a Baseline

GSM8K Baselines

Small Models (≤ 3B)

Mid-Size Models (7B)

Large Models (14B – 72B)

DAPO Baselines (AIME 2024)

Coding Baselines (LeetCode)

Vision-Language Baselines

More Comprehensive Benchmarks

PPO Configuration Guide

GRPO Configuration Guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Reproducing a Baseline

​GSM8K Baselines

​Small Models (≤ 3B)

​Mid-Size Models (7B)

​Large Models (14B – 72B)

​DAPO Baselines (AIME 2024)

​Coding Baselines (LeetCode)

​Vision-Language Baselines

​More Comprehensive Benchmarks

PPO Configuration Guide

GRPO Configuration Guide

Build docs developers (and LLMs) love

Reproducing a Baseline

GSM8K Baselines

Small Models (≤ 3B)

Mid-Size Models (7B)

Large Models (14B – 72B)

DAPO Baselines (AIME 2024)

Coding Baselines (LeetCode)

Vision-Language Baselines

More Comprehensive Benchmarks