Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl maintains a set of reproducible baselines to help you calibrate expected performance before committing to a full training run, and to verify that your environment is configured correctly. Every result in this page was produced with verl and links to the exact training command and log file used.
The default value of actor_rollout_ref.actor.entropy_coeff was changed from a non-zero value to 0.0 in verl 0.3.x (2025-05-30). Results from earlier versions may differ if entropy regularisation was active by default.

Reproducing a Baseline

1

Preprocess the dataset

Download and tokenise the dataset into the parquet format expected by verl’s data loaders:
# GSM8K
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

# MATH (for SFT or GRPO on MATH)
python3 examples/data_preprocess/math_dataset.py --local_dir ~/data/math

# DAPO-Math-17k (for DAPO / AIME experiments)
bash prepare_dapo_data.sh
2

Run a canonical training script

Each algorithm ships a ready-to-use shell script. Pick the one that matches your model and backend:
# PPO — Qwen3-8B, FSDP
bash examples/ppo_trainer/run_qwen3_8b_fsdp.sh

# GRPO — Qwen3-8B, FSDP
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

# GRPO — Qwen3-8B, Megatron-LM
bash examples/grpo_trainer/run_qwen3_8b_megatron.sh
You can override the most commonly tuned knobs via environment variables:
MODEL_PATH=Qwen/Qwen3-8B \
TRAIN_BATCH_SIZE=1024 \
ROLLOUT_N=8 \
TOTAL_EPOCHS=15 \
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh
3

Monitor training

All canonical scripts log to both the console and Weights & Biases. Set your API key before launching:
export WANDB_API_KEY=<your_key>
The project name defaults to verl_ppo_gsm8k_math (PPO) or verl_grpo_gsm8k_math (GRPO).

GSM8K Baselines

The table below covers results on the GSM8K test set. All runs use NVIDIA GPUs unless otherwise noted.
Results marked [1] used strict "####" answer extraction during evaluation. A more flexible extraction method, longer response length, or improved prompting may yield higher scores.

Small Models (≤ 3B)

HardwareModelMethodTest ScoreReference
NVIDIA GPUgoogle/gemma-2-2b-itHF checkpoint23.9HuggingFace
NVIDIA GPUgoogle/gemma-2-2b-itSFT52.06Log
NVIDIA GPUgoogle/gemma-2-2b-itSFT + PPO64.02Log
NVIDIA GPUQwen/Qwen2.5-0.5B-InstructHF checkpoint49.6Qwen Blog
NVIDIA GPUQwen/Qwen2.5-0.5B-InstructPPO56.7Log
NVIDIA GPUQwen/Qwen2.5-0.5B-InstructPRIME58.7Script
NVIDIA GPUQwen/Qwen2.5-0.5B-InstructGRPO-LoRA54.3Log
NVIDIA GPUQwen/Qwen2.5-1.5B-InstructGRPO-LoRA77.9Log
NVIDIA GPUQwen/Qwen2.5-3B-InstructGRPO-LoRA86.1Log

Mid-Size Models (7B)

HardwareModelMethodTest ScoreReference
NVIDIA GPUdeepseek-ai/deepseek-llm-7b-chatPPO (Megatron)69.5 [1]Log
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO89Script
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO (FSDP2)89.8Log
NVIDIA GPUQwen/Qwen2-7B-InstructGRPO (Megatron)89.6Log
NVIDIA GPUQwen/Qwen2-7B-InstructGPG88.0Log
NVIDIA GPUQwen/Qwen2-7B-InstructGPG (Megatron)88.0Log
NVIDIA GPUQwen/Qwen2.5-7B-InstructReMax97Script
NVIDIA GPUQwen/Qwen2.5-7B-InstructGRPO-LoRA93.4Log
NVIDIA GPUQwen/Qwen2.5-7B-InstructSPIN92Script
NVIDIA GPUQwen/Qwen2.5-7B-InstructSPPO65.6 (MATH)Script
AMD MI300deepseek-ai/deepseek-llm-7b-chatPPO70.5 [1]Log
AMD MI300deepseek-ai/deepseek-llm-7b-chatGRPO71.4 [1]Log

Large Models (14B – 72B)

HardwareModelMethodTest ScoreReference
NVIDIA GPUMixtral-8x22B-Instruct-v0.1Instruct model83.7Qwen Blog
NVIDIA GPUMixtral-8x22B-Instruct-v0.1RLOO (Megatron)92.3W&B
NVIDIA GPUQwen/Qwen2.5-14B-InstructGRPO-LoRA94.6Log
NVIDIA GPUQwen/Qwen2.5-32B-InstructGRPO-LoRA95.8Log
NVIDIA GPUQwen/Qwen2.5-72B-InstructGRPO-LoRA96.0Log
Reproducing results for 14B+ models requires multiple high-memory GPUs (A100 80GB or H100). The LoRA variants are more accessible — GRPO-LoRA results above use LoRA rank 32 and can be run on 4–8× A100 40GB.

DAPO Baselines (AIME 2024)

DAPO experiments use the DAPO-Math-17k training set and the AIME-2024 test set.
For Qwen/Qwen2.5-Math-7B, the max_position_embeddings was extended to 32768 without observed performance degradation in order to accommodate longer response lengths.
HardwareModelMethodAIME 2024 Acc.Reference
NVIDIA GPUQwen/Qwen2.5-Math-7B (32k)DAPO36.3W&B
NVIDIA GPUQwen/Qwen2.5-7B-InstructDAPO + Code Interpreter40.0Script
The full DAPO-32B run requires 16 nodes × 8 × H800 GPUs. See recipe/dapo/run_dapo_qwen2.5_32b.sh for the exact launch configuration.

Coding Baselines (LeetCode)

HardwareModelMethodLeetCode ScoreReference
NVIDIA GPUPRIME-RL/Eurus-2-7B-SFTPRIME36.1Script

Vision-Language Baselines

HardwareModelMethodDatasetScoreReference
NVIDIA GPUQwen/Qwen2.5-VL-7B-InstructGRPO (Megatron)GEO3k65.4Script

More Comprehensive Benchmarks

The results on this page cover the most commonly reproduced configurations. For a broader set of benchmarks — including additional model families, multi-node scale runs, and ablation studies — see the recipe/ directory in the verl-recipe repository.

PPO Configuration Guide

Full reference for PPO hyperparameters, KL control, Dual-Clip, and training backends.

GRPO Configuration Guide

Configure group size, KL loss type, loss aggregation, and DrGRPO length-bias correction.

Build docs developers (and LLMs) love