verl maintains a set of reproducible baselines to help you calibrate expected performance before committing to a full training run, and to verify that your environment is configured correctly. Every result in this page was produced with verl and links to the exact training command and log file used.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
The default value of
actor_rollout_ref.actor.entropy_coeff was changed from a non-zero value to 0.0 in verl 0.3.x (2025-05-30). Results from earlier versions may differ if entropy regularisation was active by default.Reproducing a Baseline
Preprocess the dataset
Download and tokenise the dataset into the parquet format expected by verl’s data loaders:
Run a canonical training script
Each algorithm ships a ready-to-use shell script. Pick the one that matches your model and backend:You can override the most commonly tuned knobs via environment variables:
GSM8K Baselines
The table below covers results on the GSM8K test set. All runs use NVIDIA GPUs unless otherwise noted.Results marked [1] used strict
"####" answer extraction during evaluation. A more flexible extraction method, longer response length, or improved prompting may yield higher scores.Small Models (≤ 3B)
| Hardware | Model | Method | Test Score | Reference |
|---|---|---|---|---|
| NVIDIA GPU | google/gemma-2-2b-it | HF checkpoint | 23.9 | HuggingFace |
| NVIDIA GPU | google/gemma-2-2b-it | SFT | 52.06 | Log |
| NVIDIA GPU | google/gemma-2-2b-it | SFT + PPO | 64.02 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | HF checkpoint | 49.6 | Qwen Blog |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | PRIME | 58.7 | Script |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | GRPO-LoRA | 54.3 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-1.5B-Instruct | GRPO-LoRA | 77.9 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-3B-Instruct | GRPO-LoRA | 86.1 | Log |
Mid-Size Models (7B)
| Hardware | Model | Method | Test Score | Reference |
|---|---|---|---|---|
| NVIDIA GPU | deepseek-ai/deepseek-llm-7b-chat | PPO (Megatron) | 69.5 [1] | Log |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO | 89 | Script |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO (FSDP2) | 89.8 | Log |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO (Megatron) | 89.6 | Log |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GPG | 88.0 | Log |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GPG (Megatron) | 88.0 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | ReMax | 97 | Script |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | GRPO-LoRA | 93.4 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | SPIN | 92 | Script |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | SPPO | 65.6 (MATH) | Script |
| AMD MI300 | deepseek-ai/deepseek-llm-7b-chat | PPO | 70.5 [1] | Log |
| AMD MI300 | deepseek-ai/deepseek-llm-7b-chat | GRPO | 71.4 [1] | Log |
Large Models (14B – 72B)
| Hardware | Model | Method | Test Score | Reference |
|---|---|---|---|---|
| NVIDIA GPU | Mixtral-8x22B-Instruct-v0.1 | Instruct model | 83.7 | Qwen Blog |
| NVIDIA GPU | Mixtral-8x22B-Instruct-v0.1 | RLOO (Megatron) | 92.3 | W&B |
| NVIDIA GPU | Qwen/Qwen2.5-14B-Instruct | GRPO-LoRA | 94.6 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-32B-Instruct | GRPO-LoRA | 95.8 | Log |
| NVIDIA GPU | Qwen/Qwen2.5-72B-Instruct | GRPO-LoRA | 96.0 | Log |
DAPO Baselines (AIME 2024)
DAPO experiments use the DAPO-Math-17k training set and the AIME-2024 test set.For Qwen/Qwen2.5-Math-7B, the
max_position_embeddings was extended to 32768 without observed performance degradation in order to accommodate longer response lengths.Coding Baselines (LeetCode)
| Hardware | Model | Method | LeetCode Score | Reference |
|---|---|---|---|---|
| NVIDIA GPU | PRIME-RL/Eurus-2-7B-SFT | PRIME | 36.1 | Script |
Vision-Language Baselines
| Hardware | Model | Method | Dataset | Score | Reference |
|---|---|---|---|---|---|
| NVIDIA GPU | Qwen/Qwen2.5-VL-7B-Instruct | GRPO (Megatron) | GEO3k | 65.4 | Script |
More Comprehensive Benchmarks
The results on this page cover the most commonly reproduced configurations. For a broader set of benchmarks — including additional model families, multi-node scale runs, and ablation studies — see therecipe/ directory in the verl-recipe repository.
PPO Configuration Guide
Full reference for PPO hyperparameters, KL control, Dual-Clip, and training backends.
GRPO Configuration Guide
Configure group size, KL loss type, loss aggregation, and DrGRPO length-bias correction.