RL training pipelines have multiple sequential stages — rollout generation, reference log-prob computation, critic forward pass, and actor update — so a bottleneck in any one stage slows the entire loop. This guide walks through the tuning levers available for each stage, from rollout engine configuration to kernel-level optimizations, and explains how to profile your training run to identify where time is actually being spent.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
Rollout Generation Tuning
Rollout generation (the vLLM or SGLang inference step) is typically the longest stage in the RL loop. Before tuning, enable rollout statistics logging so you can see what the engine is actually doing:GPU Memory Utilization
gpu_memory_utilization controls how much GPU memory the rollout engine allocates for its KV cache. Higher values mean more parallel decoding capacity:
Request Batching
If the GPU cache utilization logged by the rollout engine is low, increase the effective batch size in the decoding stage:Tensor Parallel vs. Data Parallel Trade-off
Smaller tensor parallel size spawns more DP replicas, which usually yields higher throughput — but also increases total KV cache memory consumption. The right balance depends on your model size and hardware:CUDA Graph Optimization
Enabling CUDA graphs can improve throughput by reducing kernel launch overhead. Specify the batch sizes to capture:Sequence Packing
Padding tokens in variable-length batches waste compute. Sequence packing removes them by concatenating sequences into contiguous token streams:verl/models/registry.py and open a PR.
Batch Size Tuning
verl distinguishes between algorithmic parameters (global, single-controller perspective) and performance parameters (local, per-GPU allocation):- Algorithmic:
train_batch_size,ppo_mini_batch_size— set globally, normalized per worker - Performance:
*micro_batch_size_per_gpu— set per GPU, control actual memory per step
Always use
*micro_batch_size_per_gpu (not the deprecated *micro_batch_size). The _per_gpu suffix avoids normalization confusion when changing the number of GPUs.Static Batch Size Tips
- Enable gradient checkpointing first — it unlocks larger micro-batch sizes:
-
Increase
*micro_batch_size_per_gpuas much as possible until it equals the normalizedmini_batch_size. - Forward-only operations (log-prob computation, value estimation) can use 2× the training micro-batch size:
- Critic and reward model micro-batch sizes can be larger than the actor’s (smaller vocab size in the final layer reduces activation memory).
Dynamic Batch Size
Dynamic batch sizing packs a fixed token budget per step instead of a fixed sample count. This adapts automatically to variable sequence lengths:use_dynamic_bsz=true you do not need to tune *micro_batch_size_per_gpu. For a complete working example see examples/ppo_trainer/run_qwen3_8b_fsdp.sh.
Ulysses Sequence Parallelism
For long-context training (sequences > 32k tokens), Ulysses sequence parallelism splits each sequence across multiple GPUs during attention computation:*micro_batch_size_per_gpu and *max_token_len_per_gpu to avoid OOM.
LigerKernel
LigerKernel provides fused Triton kernels for RMSNorm, SwiGLU, and RoPE that can improve training throughput for both SFT and RL (PPO/GRPO) training:false. When enabled, verl applies Liger’s fused kernels to the model internals. Note that fused_linear_cross_entropy is disabled because verl computes log-probabilities through its own path.
use_liger is compatible with use_fused_kernels — they operate at different levels (Liger optimizes model internals, fused kernels optimize the output head). Using both together gives the best speed-memory tradeoff.
Forward Prefetch in FSDP
During the forward pass, FSDP performs all-gather operations to reconstruct full parameter tensors. Enabling forward prefetch overlaps the all-gather for the next layer with the current layer’s forward computation:Backward prefetch is intentionally not supported — the
BACKWARD_POST policy can prefetch incorrectly in nested-module cases. See the FSDP documentation for details.Memory Optimization for Entropy Computation
The logits tensor (shape[bsz×seq_len, vocab_size]) consumes significant memory. When compute_entropy_from_logits is active, peak memory reaches roughly [bsz×seq_len, vocab_size] × 7 bytes. Two options reduce this:
Chunked entropy (reduces forward-pass memory peak):
enable_gradient_checkpointing does not cover entropy calculations, so entropy_checkpointing is needed separately.
Profiling
nsys Profiling
To capture an nsys profile for steps 5–10:torch Profiling
Cluster-Level Monitoring
verl supports Grafana + Prometheus for cluster-level GPU utilization, throughput, and memory monitoring. Configure your Ray cluster with the appropriate exporters and point Grafana at the Prometheus endpoint.Migrating from FSDP1 to FSDP2
FSDP2 offers measurable improvements and is the recommended backend for new training runs:| Metric | FSDP2 vs FSDP1 |
|---|---|
| GPU memory | ~7% lower |
| Throughput (BF16) | ~1.5% higher |
| Composability | Better with DTensor, torch.compile, per-parameter sharding |