Quickstart: PPO Training on GSM8K with verl

This guide walks you through your first reinforcement learning post-training run with verl. You will train Qwen2.5-0.5B-Instruct on the GSM8K elementary-school math dataset using Proximal Policy Optimization (PPO) with a rule-based reward function — no separate reward model required. The whole pipeline, from data preprocessing to a trained checkpoint, runs on a single GPU.

Prerequisites: verl and its dependencies must be installed (Docker image recommended — see the Installation guide). Your GPU must have at least 24 GB of HBM (e.g., an A10G, A100, H100, or equivalent).

About GSM8K

GSM8K is a dataset of 8,500 grade-school math word problems. The model is asked to produce a step-by-step solution ending with a numerical answer marked by four # symbols. verl extracts the final answer using regular-expression matching and assigns a reward of 1.0 for a correct answer and 0.0 otherwise. This makes it a clean testbed for RL training without any learned reward model. Example prompt:

Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.

Expected solution ending:

… she used 7/20 × 120 = 42 #### 42

PPO Training Walkthrough

Prepare the GSM8K dataset

verl reads training data from Parquet files. The preprocessing script downloads the dataset from Hugging Face and converts it to the required format, adding the fields needed to compute RL rewards:

python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

This writes two files:

~/data/gsm8k/train.parquet — 7,473 training problems
~/data/gsm8k/test.parquet — 1,319 test problems

Download the model

The example uses Qwen/Qwen2.5-0.5B-Instruct, a compact but capable instruction-tuned model. Download it via the Transformers pipeline (this caches the weights to ~/.cache/huggingface):

python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')"

Set VERL_USE_MODELSCOPE=True to download models from ModelScope instead of Hugging Face — useful if you have faster access to ModelScope from your region.

If you want to perform supervised fine-tuning (SFT) before RL, see the SFT & Other Algorithms guide and the SFT Trainer.

Run PPO training

verl uses Hydra for configuration. All settings are passed as command-line overrides. Launch training with the command below, adjusting data.train_files, data.val_files, actor_rollout_ref.model.path, and critic.model.path to match your local paths if they differ:

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    critic.optim.lr=1e-5 \
    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    critic.ppo_micro_batch_size_per_gpu=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=console \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

Key configuration fields explained:

Config key	Purpose
`data.train_files` / `data.val_files`	Paths to the Parquet files produced in Step 1
`data.train_batch_size`	Total number of prompts per training step (across all GPUs)
`data.max_prompt_length` / `data.max_response_length`	Token length caps for prompts and generated responses
`actor_rollout_ref.model.path`	Hugging Face model name or local path for the actor and reference policy
`actor_rollout_ref.actor.optim.lr`	Learning rate for the actor (policy)
`actor_rollout_ref.actor.ppo_mini_batch_size`	Mini-batch size for PPO gradient updates
`actor_rollout_ref.rollout.tensor_model_parallel_size`	Tensor parallelism degree for the vLLM rollout engine
`actor_rollout_ref.rollout.gpu_memory_utilization`	Fraction of GPU memory reserved for the vLLM KV cache
`critic.model.path`	Hugging Face model name or local path for the critic (value model)
`algorithm.kl_ctrl.kl_coef`	KL penalty coefficient — keeps the policy from drifting too far from the reference
`trainer.total_epochs`	Number of complete passes over the training dataset

Expected console output:You will see per-step logs like the following:

step:0 - timing/gen:21.470 - timing/ref:4.360 - timing/values:5.800 \
  - actor/reward_kl_penalty:0.000 - critic/vf_loss:14.947 \
  - actor/pg_loss:-0.005 - critic/score/mean:0.004 \
  - response_length/mean:239.133 - prompt_length/mean:104.883

The key validation metric val/test_score/openai/gsm8k is computed every trainer.test_freq steps.If you run out of GPU memory (HBM < 32 GB), reduce the micro-batch sizes:

actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
critic.ppo_micro_batch_size_per_gpu=1

Saving and merging checkpointsCheckpoints are saved by default to checkpoints/${trainer.project_name}/${trainer.experiment_name}. To export a checkpoint to a standard Hugging Face format:

python3 -m verl.model_merger merge \
    --backend fsdp \
    --local_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor \
    --target_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor/huggingface

Monitor training with Weights & Biases

By default, verl logs metrics to the console only. To enable Weights & Biases experiment tracking, add the following overrides to your training command:

trainer.logger='["console","wandb"]' \
trainer.project_name=my_verl_project \
trainer.experiment_name=gsm8k_ppo_qwen2.5_0.5b

Key metrics to watch:

Metric	Description
`critic/score/mean`	Mean reward assigned by the rule-based reward function
`response_length/mean`	Average number of tokens in model responses
`actor/pg_loss`	Policy gradient loss — should decrease as training progresses
`critic/vf_loss`	Value function loss — high initially, should stabilize
`val/test_score/openai/gsm8k`	Held-out test accuracy (computed every `test_freq` steps)

See the Algorithm Baselines page for reference training curves and expected final accuracy values.

Running GRPO Instead of PPO

GRPO (Group Relative Policy Optimization) is a simpler RL algorithm that does not require a separate critic network. It estimates advantages by comparing rewards within a group of responses sampled for the same prompt. This reduces GPU memory usage and simplifies the config. Switch from PPO to GRPO by setting algorithm.adv_estimator=grpo and dropping all critic.* overrides:

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    algorithm.adv_estimator=grpo \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=console \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.total_epochs=15 2>&1 | tee verl_grpo.log

The key difference is algorithm.adv_estimator=grpo, which switches the advantage estimator from GAE to GRPO’s group-relative baseline. Because there is no value network, all critic.* parameters are removed.

Next Steps

Scale to multiple GPUs or nodes: see Multi-Node Training
Explore more training scripts in examples/ppo_trainer/
Understand every config parameter: see the Config Reference page
Try a full SFT + RL pipeline: see the Complete GSM8K Example

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Quickstart: PPO Training on GSM8K with verl

About GSM8K

PPO Training Walkthrough

Running GRPO Instead of PPO

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​About GSM8K

​PPO Training Walkthrough

​Running GRPO Instead of PPO

​Next Steps

Build docs developers (and LLMs) love

About GSM8K

PPO Training Walkthrough

Running GRPO Instead of PPO

Next Steps