Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

During the rollout phase of RL training, the actor model must generate full response sequences for a batch of prompts. This generation step is compute-intensive and benefits significantly from optimized inference engines with features like continuous batching, CUDA graph capture, and efficient KV cache management. verl supports three inference backends for rollout: vLLM, SGLang, and TensorRT-LLM. All three expose the same BaseRollout interface, so switching between them requires only a single config change.

vLLM Backend

vLLM is the default rollout backend for verl and the most broadly tested option across model families and hardware configurations.
Avoid vLLM 0.7.x — it contains known out-of-memory bugs during RL training. Use vLLM 0.8.3 or later.
Set VLLM_USE_V1=1 to enable vLLM’s V1 engine, which delivers meaningfully higher throughput through more aggressive CUDA graph capture and improved scheduling.

Configuration

actor_rollout_ref:
  rollout:
    name: vllm
    tensor_model_parallel_size: 1
    gpu_memory_utilization: 0.5      # share GPU memory with the training engine
    max_num_seqs: 1024
    enforce_eager: false
    load_format: dummy               # random init; weights synced from trainer at each step

Tuning Tips

KV Cache Size

Increase gpu_memory_utilization to give vLLM a larger KV cache. Balance it against the memory needed by the training engine on the same GPUs — start around 0.5 and tune upward.

Throughput vs. Latency

Smaller tensor_model_parallel_size means more data-parallel replicas for the same node, which increases aggregate generation throughput at the cost of per-replica model capacity.

Decoding Throughput

Set max_num_batched_tokens above 2048 for better decoding throughput on long-response workloads.

CUDA Graphs

Tune cudagraph_capture_sizes to match your typical batch sizes. Pre-captured graphs eliminate kernel launch overhead for common batch sizes.

Single-Node Example (vLLM)

export VLLM_USE_V1=1

python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.total_epochs=15

SGLang Backend

SGLang is a fully-featured alternative rollout engine developed with RL workloads in mind. It covers the same basic feature set as vLLM (memory saving, multi-node rollout) and adds several capabilities that are unique to agentic and multi-turn RL scenarios.
SGLang’s RL integration is under active development. Features and configuration options may change between releases. Always refer to the SGLang RL tracking roadmap for the latest status.

Installation

Install verl with the SGLang extras to get the pinned-compatible version:
pip install --upgrade pip
# Installs the SGLang version pinned in setup.py (currently 0.4.8, subject to updates)
pip install -e ".[sglang]"
Required environment versions:
  • PyTorch: 2.6.0+cu124
  • CUDA: 12.4
  • flashinfer-python: 0.2.5+cu124torch2.6
  • SGLang: 0.4.6.post5 or the version pinned in setup.py
  • sgl-kernel: 0.1.4

Configuration

Switching from vLLM to SGLang requires only changing the rollout.name field:
actor_rollout_ref:
  rollout:
    name: sglang                          # switch from vllm to sglang
    tensor_model_parallel_size: 2
    gpu_memory_utilization: 0.8
    free_cache_engine: true               # release KV cache during training steps

Single-Node Example (SGLang)

export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True is required when using verl’s Ray-based multi-process training. Because different workers initialize the model at different times, GPU free memory levels diverge across ranks. SGLang’s DeviceMesh initialization checks for memory balance across all TP ranks and raises an error if the difference exceeds ~10%. Disabling this check allows training to proceed normally.

Multi-Node Example (SGLang, TP=16)

# Node 0 — start the Ray head
ray start --head --dashboard-host=0.0.0.0

# Node 1 — join the cluster
ray start --address='<node0-ip>:6379'

# Launch training with TP=16 across 2 nodes
python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.model.path=meta-llama/Llama-3.1-8B-Instruct \
    actor_rollout_ref.rollout.tensor_model_parallel_size=16 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.rollout.free_cache_engine=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

SGLang Features for RL

The SGLang team is actively developing RL-specific extensions. Current and in-progress features include:
FeatureDescription
Multi-turn agentic RLGenerate multi-turn conversations with tool calls between turns
Partial rolloutGenerate part of a response, invoke an external tool, then continue generation — all within a single rollout step
Server-based async rolloutDecouple rollout generation from parameter updates via an HTTP server interface, enabling asynchronous RL pipelines
VLM RLHFVision-language model rollout support

TensorRT-LLM Backend

TensorRT-LLM is NVIDIA’s high-performance inference engine and provides state-of-the-art throughput on NVIDIA GPUs. It is particularly well-suited for FP8 quantized rollout and large-scale deployments.

Installation

Use the official verl Docker image with TensorRT-LLM pre-installed:
# Dockerfile.stable.trtllm — base image: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
docker/Dockerfile.stable.trtllm
Or install the Python extras into a compatible TensorRT-LLM environment:
pip install --upgrade pip
pip install -e ".[trtllm]"
Before launching the Ray cluster with TensorRT-LLM, unset all SLURM/MPI/PMIx environment variables to avoid PMIx mismatch errors:
for v in $(env | awk -F= '/^(PMI|PMIX|MPI|OMPI|SLURM)_/{print $1}'); do
    unset "$v"
done
All example scripts for TensorRT-LLM include this step automatically.

Key Features

TensorRT-LLM rollout support is primarily tested on Qwen3 dense and MoE model variants and includes:
  • Synchronous training (GRPO, DAPO, etc.)
  • Cross-node inference for multi-node rollout
  • FP8 refit — quantize rollout to FP8 while keeping the trainer in BF16/FP16
  • Asynchronous training — disaggregated trainer and rollout placement
  • Preliminary VLM support

Usage

INFER_BACKEND=trtllm bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

Choosing a Backend

vLLMSGLangTensorRT-LLM
Ease of setup✅ Easiest✅ Easy⚙️ Requires Docker
Multi-turn / agentic RL
FP8 rollout
Async disaggregated rollout
MoE support✅ Tested on Qwen3-MoE
VLM support✅ In progress✅ Preliminary
Recommended forGeneral useAgentic / multi-turn RLHigh-throughput production, FP8

Engine Workers

See how BaseRollout integrates with ActorRolloutRefWorker and the weight sync flow.

Ray Trainer

Understand how generate_sequences() fits into the full PPO training loop.

Build docs developers (and LLMs) love