Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

RL training pipelines have multiple sequential stages — rollout generation, reference log-prob computation, critic forward pass, and actor update — so a bottleneck in any one stage slows the entire loop. This guide walks through the tuning levers available for each stage, from rollout engine configuration to kernel-level optimizations, and explains how to profile your training run to identify where time is actually being spent.
Recommended starting configuration for a new run: enable use_remove_padding=True, use_dynamic_bsz=True, and set actor_rollout_ref.actor.ppo_max_token_len_per_gpu to at least 2× (max_prompt_length + max_response_length). This combination gives consistent throughput improvements across most workloads without requiring model-specific tuning.

Rollout Generation Tuning

Rollout generation (the vLLM or SGLang inference step) is typically the longest stage in the RL loop. Before tuning, enable rollout statistics logging so you can see what the engine is actually doing:
actor_rollout_ref:
  rollout:
    disable_log_stats: false

GPU Memory Utilization

gpu_memory_utilization controls how much GPU memory the rollout engine allocates for its KV cache. Higher values mean more parallel decoding capacity:
actor_rollout_ref:
  rollout:
    gpu_memory_utilization: 0.6  # start here; tune between 0.5 and 0.7
Setting gpu_memory_utilization too high causes OOM when the actor update step runs (optimizer states and gradients must share the same GPU). A value between 0.5 and 0.7 usually strikes the right balance. Note that the definition of this parameter differs between vLLM and SGLang: for vLLM it is a fraction of total GPU memory; for SGLang it is a fraction of free memory for static allocations (model weights and KV cache), but the remaining fraction is still used during inference.

Request Batching

If the GPU cache utilization logged by the rollout engine is low, increase the effective batch size in the decoding stage:
actor_rollout_ref:
  rollout:
    max_num_seqs: 256            # maximum concurrent requests
    max_num_batched_tokens: 4096  # must be > 2048 for good throughput

Tensor Parallel vs. Data Parallel Trade-off

Smaller tensor parallel size spawns more DP replicas, which usually yields higher throughput — but also increases total KV cache memory consumption. The right balance depends on your model size and hardware:
actor_rollout_ref:
  rollout:
    tensor_model_parallel_size: 1  # try reducing from default; more DP replicas
For a detailed analysis of this trade-off see Section 8.4 of the HybridFlow paper.

CUDA Graph Optimization

Enabling CUDA graphs can improve throughput by reducing kernel launch overhead. Specify the batch sizes to capture:
actor_rollout_ref:
  rollout:
    enforce_eager: false           # required for cuda graphs
    cudagraph_capture_sizes: [1, 2, 4, 8, 16, 32, 64, 128]
Note that CUDA graph memory cannot be offloaded to CPU, so it occupies GPU memory even during the actor update step. Use smaller capture sizes to reduce this overhead if OOM occurs. For additional vLLM-specific tuning (chunked prefill, preemption handling), refer to the vLLM performance guide. verl recommends vLLM v0.8.3 or later.

Sequence Packing

Padding tokens in variable-length batches waste compute. Sequence packing removes them by concatenating sequences into contiguous token streams:
actor_rollout_ref:
  model:
    use_remove_padding: true
critic:
  model:
    use_remove_padding: true
This is currently validated for Llama, Mistral, Gemma (v1), and Qwen-based models. To test a new model architecture:
pytest -s tests/models/test_transformer.py
If the test passes, add the model to verl/models/registry.py and open a PR.

Batch Size Tuning

verl distinguishes between algorithmic parameters (global, single-controller perspective) and performance parameters (local, per-GPU allocation):
  • Algorithmic: train_batch_size, ppo_mini_batch_size — set globally, normalized per worker
  • Performance: *micro_batch_size_per_gpu — set per GPU, control actual memory per step
Always use *micro_batch_size_per_gpu (not the deprecated *micro_batch_size). The _per_gpu suffix avoids normalization confusion when changing the number of GPUs.

Static Batch Size Tips

  1. Enable gradient checkpointing first — it unlocks larger micro-batch sizes:
actor_rollout_ref:
  model:
    enable_gradient_checkpointing: true
critic:
  model:
    enable_gradient_checkpointing: true
  1. Increase *micro_batch_size_per_gpu as much as possible until it equals the normalized mini_batch_size.
  2. Forward-only operations (log-prob computation, value estimation) can use 2× the training micro-batch size:
actor_rollout_ref:
  ref:
    log_prob_micro_batch_size_per_gpu: 16   # can be 2× actor training batch
  rollout:
    log_prob_micro_batch_size_per_gpu: 16
critic:
  forward_micro_batch_size_per_gpu: 16
  1. Critic and reward model micro-batch sizes can be larger than the actor’s (smaller vocab size in the final layer reduces activation memory).

Dynamic Batch Size

Dynamic batch sizing packs a fixed token budget per step instead of a fixed sample count. This adapts automatically to variable sequence lengths:
actor_rollout_ref:
  actor:
    use_dynamic_bsz: true
    ppo_max_token_len_per_gpu: 32000   # at least 2× (max_prompt + max_response)
  ref:
    log_prob_max_token_len_per_gpu: 48000
  rollout:
    log_prob_max_token_len_per_gpu: 48000
critic:
  use_dynamic_bsz: true
  ppo_max_token_len_per_gpu: 64000     # critic can be 2× actor
reward_model:
  forward_micro_batch_size_per_gpu: 32
With use_dynamic_bsz=true you do not need to tune *micro_batch_size_per_gpu. For a complete working example see examples/ppo_trainer/run_qwen3_8b_fsdp.sh.

Ulysses Sequence Parallelism

For long-context training (sequences > 32k tokens), Ulysses sequence parallelism splits each sequence across multiple GPUs during attention computation:
actor_rollout_ref:
  actor:
    ulysses_sequence_parallel_size: 4   # split across 4 GPUs
  ref:
    ulysses_sequence_parallel_size: 4
critic:
  ulysses_sequence_parallel_size: 4
Different model roles can use different parallelism sizes. When enabling sequence parallelism for long sequences, also reduce *micro_batch_size_per_gpu and *max_token_len_per_gpu to avoid OOM.

LigerKernel

LigerKernel provides fused Triton kernels for RMSNorm, SwiGLU, and RoPE that can improve training throughput for both SFT and RL (PPO/GRPO) training:
pip install liger-kernel
actor_rollout_ref:
  model:
    use_liger: true
The default is false. When enabled, verl applies Liger’s fused kernels to the model internals. Note that fused_linear_cross_entropy is disabled because verl computes log-probabilities through its own path. use_liger is compatible with use_fused_kernels — they operate at different levels (Liger optimizes model internals, fused kernels optimize the output head). Using both together gives the best speed-memory tradeoff.

Forward Prefetch in FSDP

During the forward pass, FSDP performs all-gather operations to reconstruct full parameter tensors. Enabling forward prefetch overlaps the all-gather for the next layer with the current layer’s forward computation:
actor_rollout_ref:
  actor:
    fsdp_config:
      forward_prefetch: true
Backward prefetch is intentionally not supported — the BACKWARD_POST policy can prefetch incorrectly in nested-module cases. See the FSDP documentation for details.

Memory Optimization for Entropy Computation

The logits tensor (shape [bsz×seq_len, vocab_size]) consumes significant memory. When compute_entropy_from_logits is active, peak memory reaches roughly [bsz×seq_len, vocab_size] × 7 bytes. Two options reduce this: Chunked entropy (reduces forward-pass memory peak):
actor_rollout_ref:
  ref:
    entropy_from_logits_with_chunking: true
Entropy checkpointing (recomputes during backward; reduces training memory):
actor_rollout_ref:
  actor:
    entropy_checkpointing: true
Standard enable_gradient_checkpointing does not cover entropy calculations, so entropy_checkpointing is needed separately.

Profiling

nsys Profiling

To capture an nsys profile for steps 5–10:
global_profiler:
  tool: nsys
  steps: [5, 10]

torch Profiling

global_profiler:
  tool: torch
  steps: [5, 10]

Cluster-Level Monitoring

verl supports Grafana + Prometheus for cluster-level GPU utilization, throughput, and memory monitoring. Configure your Ray cluster with the appropriate exporters and point Grafana at the Prometheus endpoint.

Migrating from FSDP1 to FSDP2

FSDP2 offers measurable improvements and is the recommended backend for new training runs:
MetricFSDP2 vs FSDP1
GPU memory~7% lower
Throughput (BF16)~1.5% higher
ComposabilityBetter with DTensor, torch.compile, per-parameter sharding
Enable FSDP2:
actor_rollout_ref:
  actor:
    strategy: fsdp2
  ref:
    strategy: fsdp2
critic:
  strategy: fsdp2
FSDP2 requires PyTorch 2.1+. CPU offloading in FSDP2 is compatible with gradient accumulation (unlike FSDP1).

Build docs developers (and LLMs) love