This page collects the most common questions from verl users, covering installation, distributed setup, training stability, memory management, algorithm selection, and reward function implementation. If your question is not listed here, check the GitHub Discussions or open an issue.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
Installation & Setup
What CUDA and Python versions are required?
What CUDA and Python versions are required?
verlai/verl.Which Docker image should I use?
Which Docker image should I use?
verlai/verl images from DockerHub. Two main variants are published:- vLLM variant — includes vLLM as the rollout backend. Recommended for most workloads.
- SGLang variant — includes SGLang as the rollout backend. Required for multi-turn tool-use rollouts (
rollout.multi_turn.enable=True).
Can I run verl without Docker?
Can I run verl without Docker?
requirements.txt or the Dockerfile in the repo for the tested combination.How do I set up verl on a Slurm cluster?
How do I set up verl on a Slurm cluster?
- Convert the verl Docker image to an Apptainer/Singularity image (see above).
- Start a Ray cluster using Slurm following Ray’s official Slurm guide.
- Modify
examples/tutorial/slurm/ray_on_slurm.slurmwith your cluster’s resource specifications. - Submit with
sbatch.
"Unable to register worker with raylet", Slurm’s CPU affinity settings may be restricting Ray’s worker processes from seeing the raylet. Fix this by setting:Distributed Training
How do I run multi-node post-training with Ray?
How do I run multi-node post-training with Ray?
trainer.nnodes and trainer.n_gpus_per_node config fields to match your allocation.How do I generate a Ray timeline for performance analysis?
How do I generate a Ray timeline for performance analysis?
ray_kwargs.timeline_json_file to a path where the timeline JSON should be written. The file is generated at the end of the training job:chrome://tracing to visualize the execution timeline across all Ray tasks and workers.Training Stability
My reward is not increasing / training is unstable. What should I check?
My reward is not increasing / training is unstable. What should I check?
-
KL coefficient: if
kl_coefis too high, the policy barely moves; too low, and the policy diverges. Start withkl_coef=0.001for in-reward KL and adjust by monitoringactor/klin your experiment tracker. -
Reward function output range: ensure your reward function returns values in a consistent range (e.g.,
[0, 1]or[-1, 1]). Reward spikes (very high or very low outliers) destabilize training. -
Learning rate: actor learning rates for RL fine-tuning are typically much lower than SFT — try
1e-7to1e-6. A learning rate that is too high causes the policy to oscillate. -
Sample diversity: for GRPO and RLOO, increase
actor_rollout_ref.rollout.n(number of responses per prompt) to improve advantage estimation quality. A minimum ofn=4ton=8is common for math tasks. -
Advantage normalization: ensure your advantage estimator is normalizing correctly. For GRPO,
algorithm.norm_adv_by_std_in_grpo=True(the default) helps stabilize updates. - Reward function correctness: verify on a small batch that your reward function returns non-zero rewards for at least some responses. If all rewards are identical, the policy has no learning signal.
Loss suddenly becomes NaN. What happened?
Loss suddenly becomes NaN. What happened?
-
Check for reward outliers: log
reward/maxandreward/min. A sudden spike in max reward (e.g., from a reward function returninginf) will cause NaN loss. -
Enable gradient clipping: ensure
actor_rollout_ref.actor.grad_clip=1.0is set (the default). Ifgrad_clipwas disabled or set too high, large gradients can overflow FP16/BF16 precision. - Critic precision: if using PPO with a mixed-precision setup, try using FP32 for the critic value function to avoid critic value overflow.
-
Check for precision mismatch: enable
actor_rollout_ref.rollout.calculate_log_probs=Trueand monitortraining/rollout_probs_diff_mean. Values above0.01indicate a significant mismatch between rollout and training log-probabilities that can destabilize the policy gradient estimate.
actor/grad_norm keeps increasing throughout training. Is this normal?
actor/grad_norm keeps increasing throughout training. Is this normal?
actor/grad_norm is not normal and usually indicates a precision mismatch between the rollout engine and the training engine. To diagnose, enable rollout log-probability logging:training/rollout_probs_diff_mean metric. Normal values are below 0.005. If you observe values above 0.01, this confirms a precision issue.Known cause: This issue is known to occur with vLLM on non-Hopper GPUs (A100, L20, B200) when using long contexts (e.g., multi-turn reasoning models), due to a bug in Flash Attention’s KV-split LSE computation.Workaround until a fixed vLLM release is available:Memory & OOM
Getting CUDA OOM during rollout generation. How do I fix it?
Getting CUDA OOM during rollout generation. How do I fix it?
-
Reduce
gpu_memory_utilization: lower it from the default0.5to0.4or less to leave more headroom for actor parameters and optimizer states. -
Offload actor parameters: enable
actor_rollout_ref.actor.fsdp_config.param_offload=True. This moves actor model weights to CPU during the rollout stage. There is a speed cost, but it frees a large amount of GPU memory. -
Reduce response length: lower
data.max_response_length. KV cache consumption scales linearly with sequence length. -
Reduce tensor parallel size: a smaller
actor_rollout_ref.rollout.tensor_model_parallel_sizecreates more vLLM replicas, which each hold their own KV cache. If total GPU memory is the bottleneck, try larger TP instead (fewer KV caches). -
Use LoRA: reduce the actor parameter memory footprint by training with LoRA (
actor_rollout_ref.actor.lora_rank > 0).
How do I enable CPU offloading for actor and optimizer states?
How do I enable CPU offloading for actor and optimizer states?
What is the relationship between train_batch_size, ppo_mini_batch_size, and ppo_micro_batch_size_per_gpu?
What is the relationship between train_batch_size, ppo_mini_batch_size, and ppo_micro_batch_size_per_gpu?
-
data.train_batch_size— the algorithmic batch size: the number of prompts sampled from the dataset per training iteration. This determines the diversity of experience used per update. -
actor_rollout_ref.actor.ppo_mini_batch_size— the PPO update is performed by splitting the rollout batch into mini-batches of this size and taking multiple gradient steps. This is a global count across all GPUs. -
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu— the per-GPU micro-batch for a single forward/backward pass (gradient accumulation). This is a local, performance-tuning parameter.
train_batch_size >= ppo_mini_batch_size >> ppo_micro_batch_size_per_gpu × num_gpus.See the configuration diagram for a visual illustration.Algorithm Selection
Should I use PPO or GRPO for my task?
Should I use PPO or GRPO for my task?
algorithm.adv_estimator: gae):- Requires a critic model (doubles GPU memory and training time)
- Generally more stable and sample-efficient
- Works well for tasks with complex reward shaping
- Better suited to tasks where value function estimation is meaningful
algorithm.adv_estimator: grpo):- No critic model needed — significantly lower memory and compute
- Requires multiple samples per prompt (
rollout.n >= 4–8) for reliable advantage estimation - Works well for math, reasoning, and code tasks with verifiable rewards
- Less stable than PPO on some tasks but faster per iteration
What is the difference between DAPO and GRPO?
What is the difference between DAPO and GRPO?
- Decoupled clip ratios: separate clip thresholds are applied to positive-advantage samples and negative-advantage samples. This prevents the policy from being over-constrained when the sign of advantage varies across a mini-batch.
- Dynamic sampling: samples that would produce zero or near-zero gradients (due to reward collapse — all responses for a prompt receiving the same reward) are filtered out or upsampled. This avoids wasted compute and maintains a meaningful training signal.
What advantage estimators are available in verl?
What advantage estimators are available in verl?
algorithm.adv_estimator:| Value | Algorithm | Requires Critic |
|---|---|---|
gae | Generalized Advantage Estimation (standard PPO) | ✅ Yes |
grpo | Group Relative Policy Optimization | ❌ No |
grpo_vectorized | Vectorized GRPO (faster) | ❌ No |
reinforce_plus_plus | REINFORCE++ with improved baseline | ❌ No |
reinforce_plus_plus_baseline | REINFORCE++ with explicit baseline | ❌ No |
rloo | REINFORCE Leave-One-Out | ❌ No |
rloo_vectorized | Vectorized RLOO (faster) | ❌ No |
Reward Functions
How do I implement a reward function for my custom task?
How do I implement a reward function for my custom task?
compute_score and register it in the config:data_source, ground_truth, and extra_info fields are read from the reward_model column of your parquet dataset. See the GSM8K data preprocessing scripts in examples/data_preprocess/ for a complete end-to-end example.Can I use a trained reward model instead of a rule-based function?
Can I use a trained reward model instead of a rule-based function?
reward.reward_model section to deploy a model-based reward model alongside the rollout engine:AutoModelForSequenceClassification (discriminative RM). For generative reward models (LLM-as-judge), implement a custom reward function that calls the model via the API.If the reward model uses a different chat template than the policy, set data.return_raw_input_ids=True so prompts can be re-encoded with the RM’s template.Can I parallelize reward computation for faster training?
Can I parallelize reward computation for faster training?
reward.reward_manager.name: prime to enable parallel reward computation across multiple workers:prime requires that all verification functions in your reward pipeline are multiprocessing-safe (no shared mutable state, no GPU calls). Use naive (the default) if your reward function uses GPU models or has other restrictions.Triton and Compilation Errors
I'm getting a Triton compile_module_from_src error. How do I fix it?
I'm getting a Triton compile_module_from_src error. How do I fix it?
nvcc) matches the CUDA runtime version in your environment, and that CUDA_HOME or CUDA_PATH is set correctly.I'm getting NotImplementedError about TensorDict membership checks on ARM Linux.
I'm getting NotImplementedError about TensorDict membership checks on ARM Linux.
tensordict on linux-arm64 — compatible wheel versions are not available for that platform.Solution 1 — Install from source:key in tensordict_var with key in tensordict_var.keys() at the indicated line in the stack trace.Checkpoints
How do I convert a verl checkpoint to HuggingFace format?
How do I convert a verl checkpoint to HuggingFace format?
verl.model_merger tool included in the repo:target_dir will contain a standard HuggingFace model loadable with AutoModelForCausalLM.from_pretrained(). For large models that do not fit in GPU memory during merging, add --use_cpu_initialization.See the Checkpointing reference for full details including distributed merging across multiple nodes.Training crashed mid-step. Will my checkpoint be corrupted?
Training crashed mid-step. Will my checkpoint be corrupted?
latest_checkpointed_iteration.txt as the final step. If a crash occurs before this file is updated, the latest recorded checkpoint is the previous one — the partially written step is ignored on resume.For Megatron checkpoints, the ckpt_contents.json manifest serves the same role: its presence indicates a fully complete checkpoint. An incomplete checkpoint directory (no manifest) is automatically skipped.