Performance Tuning Guide for verl RL Training

RL training pipelines have multiple sequential stages — rollout generation, reference log-prob computation, critic forward pass, and actor update — so a bottleneck in any one stage slows the entire loop. This guide walks through the tuning levers available for each stage, from rollout engine configuration to kernel-level optimizations, and explains how to profile your training run to identify where time is actually being spent.

Recommended starting configuration for a new run: enable use_remove_padding=True, use_dynamic_bsz=True, and set actor_rollout_ref.actor.ppo_max_token_len_per_gpu to at least 2× (max_prompt_length + max_response_length). This combination gives consistent throughput improvements across most workloads without requiring model-specific tuning.

Rollout Generation Tuning

Rollout generation (the vLLM or SGLang inference step) is typically the longest stage in the RL loop. Before tuning, enable rollout statistics logging so you can see what the engine is actually doing:

actor_rollout_ref:
  rollout:
    disable_log_stats: false

GPU Memory Utilization

gpu_memory_utilization controls how much GPU memory the rollout engine allocates for its KV cache. Higher values mean more parallel decoding capacity:

actor_rollout_ref:
  rollout:
    gpu_memory_utilization: 0.6  # start here; tune between 0.5 and 0.7

Setting gpu_memory_utilization too high causes OOM when the actor update step runs (optimizer states and gradients must share the same GPU). A value between 0.5 and 0.7 usually strikes the right balance. Note that the definition of this parameter differs between vLLM and SGLang: for vLLM it is a fraction of total GPU memory; for SGLang it is a fraction of free memory for static allocations (model weights and KV cache), but the remaining fraction is still used during inference.

Request Batching

If the GPU cache utilization logged by the rollout engine is low, increase the effective batch size in the decoding stage:

actor_rollout_ref:
  rollout:
    max_num_seqs: 256            # maximum concurrent requests
    max_num_batched_tokens: 4096  # must be > 2048 for good throughput

Tensor Parallel vs. Data Parallel Trade-off

Smaller tensor parallel size spawns more DP replicas, which usually yields higher throughput — but also increases total KV cache memory consumption. The right balance depends on your model size and hardware:

actor_rollout_ref:
  rollout:
    tensor_model_parallel_size: 1  # try reducing from default; more DP replicas

For a detailed analysis of this trade-off see Section 8.4 of the HybridFlow paper.

CUDA Graph Optimization

Enabling CUDA graphs can improve throughput by reducing kernel launch overhead. Specify the batch sizes to capture:

actor_rollout_ref:
  rollout:
    enforce_eager: false           # required for cuda graphs
    cudagraph_capture_sizes: [1, 2, 4, 8, 16, 32, 64, 128]

Note that CUDA graph memory cannot be offloaded to CPU, so it occupies GPU memory even during the actor update step. Use smaller capture sizes to reduce this overhead if OOM occurs. For additional vLLM-specific tuning (chunked prefill, preemption handling), refer to the vLLM performance guide. verl recommends vLLM v0.8.3 or later.

Sequence Packing

Padding tokens in variable-length batches waste compute. Sequence packing removes them by concatenating sequences into contiguous token streams:

actor_rollout_ref:
  model:
    use_remove_padding: true
critic:
  model:
    use_remove_padding: true

This is currently validated for Llama, Mistral, Gemma (v1), and Qwen-based models. To test a new model architecture:

pytest -s tests/models/test_transformer.py

If the test passes, add the model to verl/models/registry.py and open a PR.

Batch Size Tuning

verl distinguishes between algorithmic parameters (global, single-controller perspective) and performance parameters (local, per-GPU allocation):

Algorithmic: train_batch_size, ppo_mini_batch_size — set globally, normalized per worker
Performance: *micro_batch_size_per_gpu — set per GPU, control actual memory per step

Always use *micro_batch_size_per_gpu (not the deprecated *micro_batch_size). The _per_gpu suffix avoids normalization confusion when changing the number of GPUs.

Static Batch Size Tips

Enable gradient checkpointing first — it unlocks larger micro-batch sizes:

actor_rollout_ref:
  model:
    enable_gradient_checkpointing: true
critic:
  model:
    enable_gradient_checkpointing: true

Increase *micro_batch_size_per_gpu as much as possible until it equals the normalized mini_batch_size.
Forward-only operations (log-prob computation, value estimation) can use 2× the training micro-batch size:

actor_rollout_ref:
  ref:
    log_prob_micro_batch_size_per_gpu: 16   # can be 2× actor training batch
  rollout:
    log_prob_micro_batch_size_per_gpu: 16
critic:
  forward_micro_batch_size_per_gpu: 16

Critic and reward model micro-batch sizes can be larger than the actor’s (smaller vocab size in the final layer reduces activation memory).

Dynamic Batch Size

Dynamic batch sizing packs a fixed token budget per step instead of a fixed sample count. This adapts automatically to variable sequence lengths:

actor_rollout_ref:
  actor:
    use_dynamic_bsz: true
    ppo_max_token_len_per_gpu: 32000   # at least 2× (max_prompt + max_response)
  ref:
    log_prob_max_token_len_per_gpu: 48000
  rollout:
    log_prob_max_token_len_per_gpu: 48000
critic:
  use_dynamic_bsz: true
  ppo_max_token_len_per_gpu: 64000     # critic can be 2× actor
reward_model:
  forward_micro_batch_size_per_gpu: 32

With use_dynamic_bsz=true you do not need to tune *micro_batch_size_per_gpu. For a complete working example see examples/ppo_trainer/run_qwen3_8b_fsdp.sh.

Ulysses Sequence Parallelism

For long-context training (sequences > 32k tokens), Ulysses sequence parallelism splits each sequence across multiple GPUs during attention computation:

actor_rollout_ref:
  actor:
    ulysses_sequence_parallel_size: 4   # split across 4 GPUs
  ref:
    ulysses_sequence_parallel_size: 4
critic:
  ulysses_sequence_parallel_size: 4

Different model roles can use different parallelism sizes. When enabling sequence parallelism for long sequences, also reduce *micro_batch_size_per_gpu and *max_token_len_per_gpu to avoid OOM.

LigerKernel

LigerKernel provides fused Triton kernels for RMSNorm, SwiGLU, and RoPE that can improve training throughput for both SFT and RL (PPO/GRPO) training:

pip install liger-kernel

actor_rollout_ref:
  model:
    use_liger: true

The default is false. When enabled, verl applies Liger’s fused kernels to the model internals. Note that fused_linear_cross_entropy is disabled because verl computes log-probabilities through its own path. use_liger is compatible with use_fused_kernels — they operate at different levels (Liger optimizes model internals, fused kernels optimize the output head). Using both together gives the best speed-memory tradeoff.

Forward Prefetch in FSDP

During the forward pass, FSDP performs all-gather operations to reconstruct full parameter tensors. Enabling forward prefetch overlaps the all-gather for the next layer with the current layer’s forward computation:

actor_rollout_ref:
  actor:
    fsdp_config:
      forward_prefetch: true

Backward prefetch is intentionally not supported — the BACKWARD_POST policy can prefetch incorrectly in nested-module cases. See the FSDP documentation for details.

Memory Optimization for Entropy Computation

The logits tensor (shape [bsz×seq_len, vocab_size]) consumes significant memory. When compute_entropy_from_logits is active, peak memory reaches roughly [bsz×seq_len, vocab_size] × 7 bytes. Two options reduce this: Chunked entropy (reduces forward-pass memory peak):

actor_rollout_ref:
  ref:
    entropy_from_logits_with_chunking: true

Entropy checkpointing (recomputes during backward; reduces training memory):

actor_rollout_ref:
  actor:
    entropy_checkpointing: true

Standard enable_gradient_checkpointing does not cover entropy calculations, so entropy_checkpointing is needed separately.

Profiling

nsys Profiling

To capture an nsys profile for steps 5–10:

global_profiler:
  tool: nsys
  steps: [5, 10]

torch Profiling

global_profiler:
  tool: torch
  steps: [5, 10]

Cluster-Level Monitoring

verl supports Grafana + Prometheus for cluster-level GPU utilization, throughput, and memory monitoring. Configure your Ray cluster with the appropriate exporters and point Grafana at the Prometheus endpoint.

Migrating from FSDP1 to FSDP2

FSDP2 offers measurable improvements and is the recommended backend for new training runs:

Metric	FSDP2 vs FSDP1
GPU memory	~7% lower
Throughput (BF16)	~1.5% higher
Composability	Better with DTensor, `torch.compile`, per-parameter sharding

Enable FSDP2:

actor_rollout_ref:
  actor:
    strategy: fsdp2
  ref:
    strategy: fsdp2
critic:
  strategy: fsdp2

FSDP2 requires PyTorch 2.1+. CPU offloading in FSDP2 is compatible with gradient accumulation (unlike FSDP1).

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Performance Tuning Guide for verl RL Training

Rollout Generation Tuning

GPU Memory Utilization

Request Batching

Tensor Parallel vs. Data Parallel Trade-off

CUDA Graph Optimization

Sequence Packing

Batch Size Tuning

Static Batch Size Tips

Dynamic Batch Size

Ulysses Sequence Parallelism

LigerKernel

Forward Prefetch in FSDP

Memory Optimization for Entropy Computation

Profiling

nsys Profiling

torch Profiling

Cluster-Level Monitoring

Migrating from FSDP1 to FSDP2

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Rollout Generation Tuning

​GPU Memory Utilization

​Request Batching

​Tensor Parallel vs. Data Parallel Trade-off

​CUDA Graph Optimization

​Sequence Packing

​Batch Size Tuning

​Static Batch Size Tips

​Dynamic Batch Size

​Ulysses Sequence Parallelism

​LigerKernel

​Forward Prefetch in FSDP

​Memory Optimization for Entropy Computation

​Profiling

​nsys Profiling

​torch Profiling

​Cluster-Level Monitoring

​Migrating from FSDP1 to FSDP2

Build docs developers (and LLMs) love

Rollout Generation Tuning

GPU Memory Utilization

Request Batching

Tensor Parallel vs. Data Parallel Trade-off

CUDA Graph Optimization

Sequence Packing

Batch Size Tuning

Static Batch Size Tips

Dynamic Batch Size

Ulysses Sequence Parallelism

LigerKernel

Forward Prefetch in FSDP

Memory Optimization for Entropy Computation

Profiling

nsys Profiling

torch Profiling

Cluster-Level Monitoring

Migrating from FSDP1 to FSDP2