During the rollout phase of RL training, the actor model must generate full response sequences for a batch of prompts. This generation step is compute-intensive and benefits significantly from optimized inference engines with features like continuous batching, CUDA graph capture, and efficient KV cache management. verl supports three inference backends for rollout: vLLM, SGLang, and TensorRT-LLM. All three expose the sameDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
BaseRollout interface, so switching between them requires only a single config change.
vLLM Backend
vLLM is the default rollout backend for verl and the most broadly tested option across model families and hardware configurations.Configuration
Tuning Tips
KV Cache Size
Increase
gpu_memory_utilization to give vLLM a larger KV cache. Balance it against the memory needed by the training engine on the same GPUs — start around 0.5 and tune upward.Throughput vs. Latency
Smaller
tensor_model_parallel_size means more data-parallel replicas for the same node, which increases aggregate generation throughput at the cost of per-replica model capacity.Decoding Throughput
Set
max_num_batched_tokens above 2048 for better decoding throughput on long-response workloads.CUDA Graphs
Tune
cudagraph_capture_sizes to match your typical batch sizes. Pre-captured graphs eliminate kernel launch overhead for common batch sizes.Single-Node Example (vLLM)
SGLang Backend
SGLang is a fully-featured alternative rollout engine developed with RL workloads in mind. It covers the same basic feature set as vLLM (memory saving, multi-node rollout) and adds several capabilities that are unique to agentic and multi-turn RL scenarios.SGLang’s RL integration is under active development. Features and configuration options may change between releases. Always refer to the SGLang RL tracking roadmap for the latest status.
Installation
Install verl with the SGLang extras to get the pinned-compatible version:- PyTorch: 2.6.0+cu124
- CUDA: 12.4
- flashinfer-python: 0.2.5+cu124torch2.6
- SGLang: 0.4.6.post5 or the version pinned in
setup.py - sgl-kernel: 0.1.4
Configuration
Switching from vLLM to SGLang requires only changing therollout.name field:
Single-Node Example (SGLang)
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True is required when using verl’s Ray-based multi-process training. Because different workers initialize the model at different times, GPU free memory levels diverge across ranks. SGLang’s DeviceMesh initialization checks for memory balance across all TP ranks and raises an error if the difference exceeds ~10%. Disabling this check allows training to proceed normally.Multi-Node Example (SGLang, TP=16)
SGLang Features for RL
The SGLang team is actively developing RL-specific extensions. Current and in-progress features include:| Feature | Description |
|---|---|
| Multi-turn agentic RL | Generate multi-turn conversations with tool calls between turns |
| Partial rollout | Generate part of a response, invoke an external tool, then continue generation — all within a single rollout step |
| Server-based async rollout | Decouple rollout generation from parameter updates via an HTTP server interface, enabling asynchronous RL pipelines |
| VLM RLHF | Vision-language model rollout support |
TensorRT-LLM Backend
TensorRT-LLM is NVIDIA’s high-performance inference engine and provides state-of-the-art throughput on NVIDIA GPUs. It is particularly well-suited for FP8 quantized rollout and large-scale deployments.Installation
Use the official verl Docker image with TensorRT-LLM pre-installed:Before launching the Ray cluster with TensorRT-LLM, unset all SLURM/MPI/PMIx environment variables to avoid PMIx mismatch errors:All example scripts for TensorRT-LLM include this step automatically.
Key Features
TensorRT-LLM rollout support is primarily tested on Qwen3 dense and MoE model variants and includes:- Synchronous training (GRPO, DAPO, etc.)
- Cross-node inference for multi-node rollout
- FP8 refit — quantize rollout to FP8 while keeping the trainer in BF16/FP16
- Asynchronous training — disaggregated trainer and rollout placement
- Preliminary VLM support
Usage
- GRPO with FSDP
- GRPO with Megatron
- DAPO with FP8 Rollout
- Fully Async GRPO
Choosing a Backend
| vLLM | SGLang | TensorRT-LLM | |
|---|---|---|---|
| Ease of setup | ✅ Easiest | ✅ Easy | ⚙️ Requires Docker |
| Multi-turn / agentic RL | ❌ | ✅ | ❌ |
| FP8 rollout | ❌ | ❌ | ✅ |
| Async disaggregated rollout | ❌ | ✅ | ✅ |
| MoE support | ✅ | ✅ | ✅ Tested on Qwen3-MoE |
| VLM support | ✅ | ✅ In progress | ✅ Preliminary |
| Recommended for | General use | Agentic / multi-turn RL | High-throughput production, FP8 |
Engine Workers
See how BaseRollout integrates with ActorRolloutRefWorker and the weight sync flow.
Ray Trainer
Understand how generate_sequences() fits into the full PPO training loop.