This guide walks you through your first reinforcement learning post-training run with verl. You will trainDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
Qwen2.5-0.5B-Instruct on the GSM8K elementary-school math dataset using Proximal Policy Optimization (PPO) with a rule-based reward function — no separate reward model required. The whole pipeline, from data preprocessing to a trained checkpoint, runs on a single GPU.
Prerequisites: verl and its dependencies must be installed (Docker image recommended — see the Installation guide). Your GPU must have at least 24 GB of HBM (e.g., an A10G, A100, H100, or equivalent).
About GSM8K
GSM8K is a dataset of 8,500 grade-school math word problems. The model is asked to produce a step-by-step solution ending with a numerical answer marked by four# symbols. verl extracts the final answer using regular-expression matching and assigns a reward of 1.0 for a correct answer and 0.0 otherwise. This makes it a clean testbed for RL training without any learned reward model.
Example prompt:
Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.Expected solution ending:
… she used 7/20 × 120 = 42 #### 42
PPO Training Walkthrough
Prepare the GSM8K dataset
verl reads training data from Parquet files. The preprocessing script downloads the dataset from Hugging Face and converts it to the required format, adding the fields needed to compute RL rewards:This writes two files:
~/data/gsm8k/train.parquet— 7,473 training problems~/data/gsm8k/test.parquet— 1,319 test problems
Download the model
The example uses If you want to perform supervised fine-tuning (SFT) before RL, see the SFT & Other Algorithms guide and the SFT Trainer.
Qwen/Qwen2.5-0.5B-Instruct, a compact but capable instruction-tuned model. Download it via the Transformers pipeline (this caches the weights to ~/.cache/huggingface):Run PPO training
verl uses Hydra for configuration. All settings are passed as command-line overrides. Launch training with the command below, adjusting Key configuration fields explained:
Expected console output:You will see per-step logs like the following:The key validation metric Saving and merging checkpointsCheckpoints are saved by default to
data.train_files, data.val_files, actor_rollout_ref.model.path, and critic.model.path to match your local paths if they differ:| Config key | Purpose |
|---|---|
data.train_files / data.val_files | Paths to the Parquet files produced in Step 1 |
data.train_batch_size | Total number of prompts per training step (across all GPUs) |
data.max_prompt_length / data.max_response_length | Token length caps for prompts and generated responses |
actor_rollout_ref.model.path | Hugging Face model name or local path for the actor and reference policy |
actor_rollout_ref.actor.optim.lr | Learning rate for the actor (policy) |
actor_rollout_ref.actor.ppo_mini_batch_size | Mini-batch size for PPO gradient updates |
actor_rollout_ref.rollout.tensor_model_parallel_size | Tensor parallelism degree for the vLLM rollout engine |
actor_rollout_ref.rollout.gpu_memory_utilization | Fraction of GPU memory reserved for the vLLM KV cache |
critic.model.path | Hugging Face model name or local path for the critic (value model) |
algorithm.kl_ctrl.kl_coef | KL penalty coefficient — keeps the policy from drifting too far from the reference |
trainer.total_epochs | Number of complete passes over the training dataset |
val/test_score/openai/gsm8k is computed every trainer.test_freq steps.If you run out of GPU memory (HBM < 32 GB), reduce the micro-batch sizes:checkpoints/${trainer.project_name}/${trainer.experiment_name}. To export a checkpoint to a standard Hugging Face format:Monitor training with Weights & Biases
By default, verl logs metrics to the console only. To enable Weights & Biases experiment tracking, add the following overrides to your training command:Key metrics to watch:
See the Algorithm Baselines page for reference training curves and expected final accuracy values.
| Metric | Description |
|---|---|
critic/score/mean | Mean reward assigned by the rule-based reward function |
response_length/mean | Average number of tokens in model responses |
actor/pg_loss | Policy gradient loss — should decrease as training progresses |
critic/vf_loss | Value function loss — high initially, should stabilize |
val/test_score/openai/gsm8k | Held-out test accuracy (computed every test_freq steps) |
Running GRPO Instead of PPO
GRPO (Group Relative Policy Optimization) is a simpler RL algorithm that does not require a separate critic network. It estimates advantages by comparing rewards within a group of responses sampled for the same prompt. This reduces GPU memory usage and simplifies the config. Switch from PPO to GRPO by settingalgorithm.adv_estimator=grpo and dropping all critic.* overrides:
algorithm.adv_estimator=grpo, which switches the advantage estimator from GAE to GRPO’s group-relative baseline. Because there is no value network, all critic.* parameters are removed.
Next Steps
- Scale to multiple GPUs or nodes: see Multi-Node Training
- Explore more training scripts in
examples/ppo_trainer/ - Understand every config parameter: see the Config Reference page
- Try a full SFT + RL pipeline: see the Complete GSM8K Example