Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

This page collects the most common questions from verl users, covering installation, distributed setup, training stability, memory management, algorithm selection, and reward function implementation. If your question is not listed here, check the GitHub Discussions or open an issue.

Installation & Setup

verl requires CUDA 12.1 or later (CUDA 12.8 is recommended for the best performance with modern inference backends) and Python 3.10 or later.For the full dependency matrix including PyTorch, vLLM, SGLang, and Ray versions, refer to the Docker images published on DockerHub as verlai/verl.
Use the official verlai/verl images from DockerHub. Two main variants are published:
  • vLLM variant — includes vLLM as the rollout backend. Recommended for most workloads.
  • SGLang variant — includes SGLang as the rollout backend. Required for multi-turn tool-use rollouts (rollout.multi_turn.enable=True).
Choose based on your rollout backend. Example pull command:
# vLLM variant (check DockerHub for the latest tag)
docker pull verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3

# Convert to Apptainer/Singularity for Slurm clusters
apptainer pull /your/dest/dir/verl.sif \
    docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
Yes. Install from source:
git clone https://github.com/verl-project/verl.git
cd verl
pip install -e .
You will need to install the inference backend (vLLM or SGLang) and Ray separately. Match versions carefully — check requirements.txt or the Dockerfile in the repo for the tested combination.
verl runs on top of Ray. The recommended approach for Slurm is:
  1. Convert the verl Docker image to an Apptainer/Singularity image (see above).
  2. Start a Ray cluster using Slurm following Ray’s official Slurm guide.
  3. Modify examples/tutorial/slurm/ray_on_slurm.slurm with your cluster’s resource specifications.
  4. Submit with sbatch.
Common Slurm issue: If you see "Unable to register worker with raylet", Slurm’s CPU affinity settings may be restricting Ray’s worker processes from seeing the raylet. Fix this by setting:
ray_kwargs.ray_init.num_cpus=<number_allowed_by_your_cluster>

Distributed Training

Start a Ray cluster on your nodes, then set the trainer.nnodes and trainer.n_gpus_per_node config fields to match your allocation.
# On the head node
ray start --head --port=6379

# On each worker node
ray start --address=<head_node_ip>:6379

# Launch training
python -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8 \
    ...
All nodes must be able to access the model weights and training data (via shared filesystem, NFS, or HDFS). Follow the Ray documentation for cluster startup details.
Set ray_kwargs.timeline_json_file to a path where the timeline JSON should be written. The file is generated at the end of the training job:
python -m verl.trainer.main_ppo \
    +ray_kwargs.timeline_json_file=/tmp/ray_timeline.json \
    ...
Load the output file in Perfetto UI or chrome://tracing to visualize the execution timeline across all Ray tasks and workers.

Training Stability

Try the following diagnostics in order:
  1. KL coefficient: if kl_coef is too high, the policy barely moves; too low, and the policy diverges. Start with kl_coef=0.001 for in-reward KL and adjust by monitoring actor/kl in your experiment tracker.
  2. Reward function output range: ensure your reward function returns values in a consistent range (e.g., [0, 1] or [-1, 1]). Reward spikes (very high or very low outliers) destabilize training.
  3. Learning rate: actor learning rates for RL fine-tuning are typically much lower than SFT — try 1e-7 to 1e-6. A learning rate that is too high causes the policy to oscillate.
  4. Sample diversity: for GRPO and RLOO, increase actor_rollout_ref.rollout.n (number of responses per prompt) to improve advantage estimation quality. A minimum of n=4 to n=8 is common for math tasks.
  5. Advantage normalization: ensure your advantage estimator is normalizing correctly. For GRPO, algorithm.norm_adv_by_std_in_grpo=True (the default) helps stabilize updates.
  6. Reward function correctness: verify on a small batch that your reward function returns non-zero rewards for at least some responses. If all rewards are identical, the policy has no learning signal.
NaN loss is usually caused by gradient overflow or reward spikes. Try:
  1. Check for reward outliers: log reward/max and reward/min. A sudden spike in max reward (e.g., from a reward function returning inf) will cause NaN loss.
  2. Enable gradient clipping: ensure actor_rollout_ref.actor.grad_clip=1.0 is set (the default). If grad_clip was disabled or set too high, large gradients can overflow FP16/BF16 precision.
  3. Critic precision: if using PPO with a mixed-precision setup, try using FP32 for the critic value function to avoid critic value overflow.
  4. Check for precision mismatch: enable actor_rollout_ref.rollout.calculate_log_probs=True and monitor training/rollout_probs_diff_mean. Values above 0.01 indicate a significant mismatch between rollout and training log-probabilities that can destabilize the policy gradient estimate.
A continuously increasing actor/grad_norm is not normal and usually indicates a precision mismatch between the rollout engine and the training engine. To diagnose, enable rollout log-probability logging:
actor_rollout_ref.rollout.calculate_log_probs=True
This adds the training/rollout_probs_diff_mean metric. Normal values are below 0.005. If you observe values above 0.01, this confirms a precision issue.Known cause: This issue is known to occur with vLLM on non-Hopper GPUs (A100, L20, B200) when using long contexts (e.g., multi-turn reasoning models), due to a bug in Flash Attention’s KV-split LSE computation.Workaround until a fixed vLLM release is available:
+actor_rollout_ref.rollout.engine_kwargs.vllm.disable_cascade_attn=True

Memory & OOM

Rollout OOM usually means the vLLM/SGLang KV cache and the actor model weights are competing for the same GPU memory. Try the following in order:
  1. Reduce gpu_memory_utilization: lower it from the default 0.5 to 0.4 or less to leave more headroom for actor parameters and optimizer states.
  2. Offload actor parameters: enable actor_rollout_ref.actor.fsdp_config.param_offload=True. This moves actor model weights to CPU during the rollout stage. There is a speed cost, but it frees a large amount of GPU memory.
  3. Reduce response length: lower data.max_response_length. KV cache consumption scales linearly with sequence length.
  4. Reduce tensor parallel size: a smaller actor_rollout_ref.rollout.tensor_model_parallel_size creates more vLLM replicas, which each hold their own KV cache. If total GPU memory is the bottleneck, try larger TP instead (fewer KV caches).
  5. Use LoRA: reduce the actor parameter memory footprint by training with LoRA (actor_rollout_ref.actor.lora_rank > 0).
FSDP (FSDP1):
actor_rollout_ref:
  actor:
    fsdp_config:
      param_offload: True       # offload model parameters
      optimizer_offload: True   # offload optimizer states (Adam moments)
FSDP2 offers the same capability with better composability:
actor_rollout_ref:
  actor:
    strategy: fsdp2
    fsdp_config:
      offload_policy: True
FSDP2 is compatible with gradient accumulation and generally recommended for new workloads (7% lower memory, 1.5% higher throughput vs FSDP1 per PyTorch TorchTitan benchmarks).Reference model (recommended for 7B+ models):
actor_rollout_ref:
  ref:
    fsdp_config:
      param_offload: True
These three fields operate at different levels of the training hierarchy:
  • data.train_batch_size — the algorithmic batch size: the number of prompts sampled from the dataset per training iteration. This determines the diversity of experience used per update.
  • actor_rollout_ref.actor.ppo_mini_batch_size — the PPO update is performed by splitting the rollout batch into mini-batches of this size and taking multiple gradient steps. This is a global count across all GPUs.
  • actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu — the per-GPU micro-batch for a single forward/backward pass (gradient accumulation). This is a local, performance-tuning parameter.
A typical relationship: train_batch_size >= ppo_mini_batch_size >> ppo_micro_batch_size_per_gpu × num_gpus.See the configuration diagram for a visual illustration.

Algorithm Selection

Both are supported and production-tested in verl. The choice depends on your resources and task:PPO (algorithm.adv_estimator: gae):
  • Requires a critic model (doubles GPU memory and training time)
  • Generally more stable and sample-efficient
  • Works well for tasks with complex reward shaping
  • Better suited to tasks where value function estimation is meaningful
GRPO (algorithm.adv_estimator: grpo):
  • No critic model needed — significantly lower memory and compute
  • Requires multiple samples per prompt (rollout.n >= 4–8) for reliable advantage estimation
  • Works well for math, reasoning, and code tasks with verifiable rewards
  • Less stable than PPO on some tasks but faster per iteration
For a quick start on math/reasoning tasks, GRPO is the typical choice. For RLHF alignment with a reward model, PPO’s stability advantages often outweigh the extra compute cost.
DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is an algorithm variant available in verl that extends GRPO with two key modifications:
  1. Decoupled clip ratios: separate clip thresholds are applied to positive-advantage samples and negative-advantage samples. This prevents the policy from being over-constrained when the sign of advantage varies across a mini-batch.
  2. Dynamic sampling: samples that would produce zero or near-zero gradients (due to reward collapse — all responses for a prompt receiving the same reward) are filtered out or upsampled. This avoids wasted compute and maintains a meaningful training signal.
DAPO is particularly useful when training on tasks where many prompts have reward collapse (e.g., prompts where the model answers correctly or incorrectly with very high consistency). Standard GRPO would plateau in these cases while DAPO can continue to improve.
verl supports the following estimators via algorithm.adv_estimator:
ValueAlgorithmRequires Critic
gaeGeneralized Advantage Estimation (standard PPO)✅ Yes
grpoGroup Relative Policy Optimization❌ No
grpo_vectorizedVectorized GRPO (faster)❌ No
reinforce_plus_plusREINFORCE++ with improved baseline❌ No
reinforce_plus_plus_baselineREINFORCE++ with explicit baseline❌ No
rlooREINFORCE Leave-One-Out❌ No
rloo_vectorizedVectorized RLOO (faster)❌ No

Reward Functions

Implement a Python function with the signature compute_score and register it in the config:
# my_reward.py
def compute_score(data_source, solution_str, ground_truth, extra_info=None):
    """
    Args:
        data_source (str): dataset name/identifier
        solution_str (str): the model's generated response
        ground_truth (str): the expected answer or reference
        extra_info (dict | None): additional metadata from the dataset row

    Returns:
        float: reward score
    """
    # Example: exact match reward
    if solution_str.strip() == ground_truth.strip():
        return 1.0
    return 0.0
Point the config at your file:
reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: compute_score  # default name; can be omitted
The data_source, ground_truth, and extra_info fields are read from the reward_model column of your parquet dataset. See the GSM8K data preprocessing scripts in examples/data_preprocess/ for a complete end-to-end example.
Yes. Configure the reward.reward_model section to deploy a model-based reward model alongside the rollout engine:
reward:
  reward_model:
    enable: True
    model_path: ~/models/my-reward-model
    rollout:
      name: vllm
      tensor_model_parallel_size: 2
      gpu_memory_utilization: 0.5
The reward model must be compatible with AutoModelForSequenceClassification (discriminative RM). For generative reward models (LLM-as-judge), implement a custom reward function that calls the model via the API.If the reward model uses a different chat template than the policy, set data.return_raw_input_ids=True so prompts can be re-encoded with the RM’s template.
Yes. Set reward.reward_manager.name: prime to enable parallel reward computation across multiple workers:
reward:
  num_workers: 8        # number of parallel reward manager processes
  reward_manager:
    name: prime          # parallel verification; use naive for sequential
prime requires that all verification functions in your reward pipeline are multiprocessing-safe (no shared mutable state, no GPU calls). Use naive (the default) if your reward function uses GPU models or has other restrictions.

Triton and Compilation Errors

This error occurs when Triton cannot compile its CUDA driver utilities, typically due to a missing or incompatible CUDA toolkit in the environment.The quickest fix is to disable torch compile for fused kernels:
actor_rollout_ref.actor.use_torch_compile=False
This disables JIT compilation of fused kernels and falls back to standard PyTorch operations. There is a moderate throughput cost (typically 10–20%) but the training will run correctly.For a permanent fix, ensure the CUDA toolkit (headers, nvcc) matches the CUDA runtime version in your environment, and that CUDA_HOME or CUDA_PATH is set correctly.
This is a known compatibility issue with tensordict on linux-arm64 — compatible wheel versions are not available for that platform.Solution 1 — Install from source:
pip uninstall tensordict
git clone https://github.com/pytorch/tensordict.git
cd tensordict
git checkout v0.6.2
pip install -v -e .
Solution 2 — Patch the offending code locally by replacing key in tensordict_var with key in tensordict_var.keys() at the indicated line in the stack trace.

Checkpoints

Use the verl.model_merger tool included in the repo:
# FSDP checkpoint
python -m verl.model_merger merge \
    --backend fsdp \
    --local_dir checkpoints/my_project/my_run/global_steps_500/actor \
    --target_dir /path/to/output_hf_model

# Megatron checkpoint
python -m verl.model_merger merge \
    --backend megatron \
    --tie-word-embedding \
    --local_dir checkpoints/my_project/my_run/global_steps_500/actor \
    --target_dir /path/to/output_hf_model
The target_dir will contain a standard HuggingFace model loadable with AutoModelForCausalLM.from_pretrained(). For large models that do not fit in GPU memory during merging, add --use_cpu_initialization.See the Checkpointing reference for full details including distributed merging across multiple nodes.
verl writes checkpoints atomically by writing all shard files first, then updating latest_checkpointed_iteration.txt as the final step. If a crash occurs before this file is updated, the latest recorded checkpoint is the previous one — the partially written step is ignored on resume.For Megatron checkpoints, the ckpt_contents.json manifest serves the same role: its presence indicates a fully complete checkpoint. An incomplete checkpoint directory (no manifest) is automatically skipped.

Build docs developers (and LLMs) love