verl Rollout Backends: vLLM, SGLang, and TensorRT-LLM

During the rollout phase of RL training, the actor model must generate full response sequences for a batch of prompts. This generation step is compute-intensive and benefits significantly from optimized inference engines with features like continuous batching, CUDA graph capture, and efficient KV cache management. verl supports three inference backends for rollout: vLLM, SGLang, and TensorRT-LLM. All three expose the same BaseRollout interface, so switching between them requires only a single config change.

vLLM Backend

vLLM is the default rollout backend for verl and the most broadly tested option across model families and hardware configurations.

Avoid vLLM 0.7.x — it contains known out-of-memory bugs during RL training. Use vLLM 0.8.3 or later.

Set VLLM_USE_V1=1 to enable vLLM’s V1 engine, which delivers meaningfully higher throughput through more aggressive CUDA graph capture and improved scheduling.

Configuration

actor_rollout_ref:
  rollout:
    name: vllm
    tensor_model_parallel_size: 1
    gpu_memory_utilization: 0.5      # share GPU memory with the training engine
    max_num_seqs: 1024
    enforce_eager: false
    load_format: dummy               # random init; weights synced from trainer at each step

Tuning Tips

KV Cache Size

Increase gpu_memory_utilization to give vLLM a larger KV cache. Balance it against the memory needed by the training engine on the same GPUs — start around 0.5 and tune upward.

Throughput vs. Latency

Smaller tensor_model_parallel_size means more data-parallel replicas for the same node, which increases aggregate generation throughput at the cost of per-replica model capacity.

Decoding Throughput

Set max_num_batched_tokens above 2048 for better decoding throughput on long-response workloads.

CUDA Graphs

Tune cudagraph_capture_sizes to match your typical batch sizes. Pre-captured graphs eliminate kernel launch overhead for common batch sizes.

Single-Node Example (vLLM)

export VLLM_USE_V1=1

python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.total_epochs=15

SGLang Backend

SGLang is a fully-featured alternative rollout engine developed with RL workloads in mind. It covers the same basic feature set as vLLM (memory saving, multi-node rollout) and adds several capabilities that are unique to agentic and multi-turn RL scenarios.

SGLang’s RL integration is under active development. Features and configuration options may change between releases. Always refer to the SGLang RL tracking roadmap for the latest status.

Installation

Install verl with the SGLang extras to get the pinned-compatible version:

pip install --upgrade pip
# Installs the SGLang version pinned in setup.py (currently 0.4.8, subject to updates)
pip install -e ".[sglang]"

Required environment versions:

PyTorch: 2.6.0+cu124
CUDA: 12.4
flashinfer-python: 0.2.5+cu124torch2.6
SGLang: 0.4.6.post5 or the version pinned in setup.py
sgl-kernel: 0.1.4

Configuration

Switching from vLLM to SGLang requires only changing the rollout.name field:

actor_rollout_ref:
  rollout:
    name: sglang                          # switch from vllm to sglang
    tensor_model_parallel_size: 2
    gpu_memory_utilization: 0.8
    free_cache_engine: true               # release KV cache during training steps

Single-Node Example (SGLang)

export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True is required when using verl’s Ray-based multi-process training. Because different workers initialize the model at different times, GPU free memory levels diverge across ranks. SGLang’s DeviceMesh initialization checks for memory balance across all TP ranks and raises an error if the difference exceeds ~10%. Disabling this check allows training to proceed normally.

Multi-Node Example (SGLang, TP=16)

# Node 0 — start the Ray head
ray start --head --dashboard-host=0.0.0.0

# Node 1 — join the cluster
ray start --address='<node0-ip>:6379'

# Launch training with TP=16 across 2 nodes
python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.model.path=meta-llama/Llama-3.1-8B-Instruct \
    actor_rollout_ref.rollout.tensor_model_parallel_size=16 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.rollout.free_cache_engine=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

SGLang Features for RL

The SGLang team is actively developing RL-specific extensions. Current and in-progress features include:

Feature	Description
Multi-turn agentic RL	Generate multi-turn conversations with tool calls between turns
Partial rollout	Generate part of a response, invoke an external tool, then continue generation — all within a single rollout step
Server-based async rollout	Decouple rollout generation from parameter updates via an HTTP server interface, enabling asynchronous RL pipelines
VLM RLHF	Vision-language model rollout support

TensorRT-LLM Backend

TensorRT-LLM is NVIDIA’s high-performance inference engine and provides state-of-the-art throughput on NVIDIA GPUs. It is particularly well-suited for FP8 quantized rollout and large-scale deployments.

Installation

Use the official verl Docker image with TensorRT-LLM pre-installed:

# Dockerfile.stable.trtllm — base image: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
docker/Dockerfile.stable.trtllm

Or install the Python extras into a compatible TensorRT-LLM environment:

pip install --upgrade pip
pip install -e ".[trtllm]"

Before launching the Ray cluster with TensorRT-LLM, unset all SLURM/MPI/PMIx environment variables to avoid PMIx mismatch errors:

for v in $(env | awk -F= '/^(PMI|PMIX|MPI|OMPI|SLURM)_/{print $1}'); do
    unset "$v"
done

All example scripts for TensorRT-LLM include this step automatically.

Key Features

TensorRT-LLM rollout support is primarily tested on Qwen3 dense and MoE model variants and includes:

Synchronous training (GRPO, DAPO, etc.)
Cross-node inference for multi-node rollout
FP8 refit — quantize rollout to FP8 while keeping the trainer in BF16/FP16
Asynchronous training — disaggregated trainer and rollout placement
Preliminary VLM support

Usage

GRPO with FSDP
GRPO with Megatron
DAPO with FP8 Rollout
Fully Async GRPO

INFER_BACKEND=trtllm bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

INFER_BACKEND=trtllm bash examples/grpo_trainer/run_qwen3_8b_megatron.sh

INFER_BACKEND=trtllm ROLLOUT_QUANTIZATION=fp8 bash examples/grpo_trainer/run_qwen3_30b_a3b_megatron.sh

bash verl/experimental/fully_async_policy/shell/grpo_30b_a3b_base_math_megatron_4_4_mis_trtllm.sh

Choosing a Backend

	vLLM	SGLang	TensorRT-LLM
Ease of setup	✅ Easiest	✅ Easy	⚙️ Requires Docker
Multi-turn / agentic RL	❌	✅	❌
FP8 rollout	❌	❌	✅
Async disaggregated rollout	❌	✅	✅
MoE support	✅	✅	✅ Tested on Qwen3-MoE
VLM support	✅	✅ In progress	✅ Preliminary
Recommended for	General use	Agentic / multi-turn RL	High-throughput production, FP8

Engine Workers

See how BaseRollout integrates with ActorRolloutRefWorker and the weight sync flow.

Ray Trainer

Understand how generate_sequences() fits into the full PPO training loop.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

verl Rollout Backends: vLLM, SGLang, and TensorRT-LLM

vLLM Backend

Configuration

Tuning Tips

KV Cache Size

Throughput vs. Latency

Decoding Throughput

CUDA Graphs

Single-Node Example (vLLM)

SGLang Backend

Installation

Configuration

Single-Node Example (SGLang)

Multi-Node Example (SGLang, TP=16)

SGLang Features for RL

TensorRT-LLM Backend

Installation

Key Features

Usage

Choosing a Backend

Engine Workers

Ray Trainer

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​vLLM Backend

​Configuration

​Tuning Tips

KV Cache Size

Throughput vs. Latency

Decoding Throughput

CUDA Graphs

​Single-Node Example (vLLM)

​SGLang Backend

​Installation

​Configuration

​Single-Node Example (SGLang)

​Multi-Node Example (SGLang, TP=16)

​SGLang Features for RL

​TensorRT-LLM Backend

​Installation

​Key Features

​Usage

​Choosing a Backend

Engine Workers

Ray Trainer

Build docs developers (and LLMs) love

vLLM Backend

Configuration

Tuning Tips

Single-Node Example (vLLM)

SGLang Backend

Installation

Configuration

Single-Node Example (SGLang)

Multi-Node Example (SGLang, TP=16)

SGLang Features for RL

TensorRT-LLM Backend

Installation

Key Features

Usage

Choosing a Backend