Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

This guide walks you through your first reinforcement learning post-training run with verl. You will train Qwen2.5-0.5B-Instruct on the GSM8K elementary-school math dataset using Proximal Policy Optimization (PPO) with a rule-based reward function — no separate reward model required. The whole pipeline, from data preprocessing to a trained checkpoint, runs on a single GPU.
Prerequisites: verl and its dependencies must be installed (Docker image recommended — see the Installation guide). Your GPU must have at least 24 GB of HBM (e.g., an A10G, A100, H100, or equivalent).

About GSM8K

GSM8K is a dataset of 8,500 grade-school math word problems. The model is asked to produce a step-by-step solution ending with a numerical answer marked by four # symbols. verl extracts the final answer using regular-expression matching and assigns a reward of 1.0 for a correct answer and 0.0 otherwise. This makes it a clean testbed for RL training without any learned reward model. Example prompt:
Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.
Expected solution ending:
… she used 7/20 × 120 = 42 #### 42

PPO Training Walkthrough

1

Prepare the GSM8K dataset

verl reads training data from Parquet files. The preprocessing script downloads the dataset from Hugging Face and converts it to the required format, adding the fields needed to compute RL rewards:
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
This writes two files:
  • ~/data/gsm8k/train.parquet — 7,473 training problems
  • ~/data/gsm8k/test.parquet — 1,319 test problems
2

Download the model

The example uses Qwen/Qwen2.5-0.5B-Instruct, a compact but capable instruction-tuned model. Download it via the Transformers pipeline (this caches the weights to ~/.cache/huggingface):
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')"
Set VERL_USE_MODELSCOPE=True to download models from ModelScope instead of Hugging Face — useful if you have faster access to ModelScope from your region.
If you want to perform supervised fine-tuning (SFT) before RL, see the SFT & Other Algorithms guide and the SFT Trainer.
3

Run PPO training

verl uses Hydra for configuration. All settings are passed as command-line overrides. Launch training with the command below, adjusting data.train_files, data.val_files, actor_rollout_ref.model.path, and critic.model.path to match your local paths if they differ:
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    critic.optim.lr=1e-5 \
    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    critic.ppo_micro_batch_size_per_gpu=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=console \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log
Key configuration fields explained:
Config keyPurpose
data.train_files / data.val_filesPaths to the Parquet files produced in Step 1
data.train_batch_sizeTotal number of prompts per training step (across all GPUs)
data.max_prompt_length / data.max_response_lengthToken length caps for prompts and generated responses
actor_rollout_ref.model.pathHugging Face model name or local path for the actor and reference policy
actor_rollout_ref.actor.optim.lrLearning rate for the actor (policy)
actor_rollout_ref.actor.ppo_mini_batch_sizeMini-batch size for PPO gradient updates
actor_rollout_ref.rollout.tensor_model_parallel_sizeTensor parallelism degree for the vLLM rollout engine
actor_rollout_ref.rollout.gpu_memory_utilizationFraction of GPU memory reserved for the vLLM KV cache
critic.model.pathHugging Face model name or local path for the critic (value model)
algorithm.kl_ctrl.kl_coefKL penalty coefficient — keeps the policy from drifting too far from the reference
trainer.total_epochsNumber of complete passes over the training dataset
Expected console output:You will see per-step logs like the following:
step:0 - timing/gen:21.470 - timing/ref:4.360 - timing/values:5.800 \
  - actor/reward_kl_penalty:0.000 - critic/vf_loss:14.947 \
  - actor/pg_loss:-0.005 - critic/score/mean:0.004 \
  - response_length/mean:239.133 - prompt_length/mean:104.883
The key validation metric val/test_score/openai/gsm8k is computed every trainer.test_freq steps.If you run out of GPU memory (HBM < 32 GB), reduce the micro-batch sizes:
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
critic.ppo_micro_batch_size_per_gpu=1
Saving and merging checkpointsCheckpoints are saved by default to checkpoints/${trainer.project_name}/${trainer.experiment_name}. To export a checkpoint to a standard Hugging Face format:
python3 -m verl.model_merger merge \
    --backend fsdp \
    --local_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor \
    --target_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor/huggingface
4

Monitor training with Weights & Biases

By default, verl logs metrics to the console only. To enable Weights & Biases experiment tracking, add the following overrides to your training command:
trainer.logger='["console","wandb"]' \
trainer.project_name=my_verl_project \
trainer.experiment_name=gsm8k_ppo_qwen2.5_0.5b
Key metrics to watch:
MetricDescription
critic/score/meanMean reward assigned by the rule-based reward function
response_length/meanAverage number of tokens in model responses
actor/pg_lossPolicy gradient loss — should decrease as training progresses
critic/vf_lossValue function loss — high initially, should stabilize
val/test_score/openai/gsm8kHeld-out test accuracy (computed every test_freq steps)
See the Algorithm Baselines page for reference training curves and expected final accuracy values.

Running GRPO Instead of PPO

GRPO (Group Relative Policy Optimization) is a simpler RL algorithm that does not require a separate critic network. It estimates advantages by comparing rewards within a group of responses sampled for the same prompt. This reduces GPU memory usage and simplifies the config. Switch from PPO to GRPO by setting algorithm.adv_estimator=grpo and dropping all critic.* overrides:
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=512 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    algorithm.adv_estimator=grpo \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=console \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.total_epochs=15 2>&1 | tee verl_grpo.log
The key difference is algorithm.adv_estimator=grpo, which switches the advantage estimator from GAE to GRPO’s group-relative baseline. Because there is no value network, all critic.* parameters are removed.

Next Steps

  • Scale to multiple GPUs or nodes: see Multi-Node Training
  • Explore more training scripts in examples/ppo_trainer/
  • Understand every config parameter: see the Config Reference page
  • Try a full SFT + RL pipeline: see the Complete GSM8K Example

Build docs developers (and LLMs) love