TRL integrates vLLM for faster generation in online reinforcement learning methods. Online methods like GRPO and RLOO require the model to generate completions during training, which quickly becomes a bottleneck. vLLM’s PagedAttention technique stores key-value tensors in non-contiguous memory, greatly improving throughput and reducing the memory footprint for generation.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/trl/llms.txt
Use this file to discover all available pages before exploring further.
Installation
Modes of operation
TRL supports two modes for integrating vLLM during training.Colocate mode (default)
In colocate mode, vLLM runs inside the trainer process and shares GPU memory with the training model. No separate server process is required, but memory contention on the training GPUs is possible.- GRPO
- RLOO
- OnlineDPO
- NashMD
- XPO
Server mode
In server mode, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer over HTTP. This is ideal when you have GPUs dedicated to inference.Start the vLLM server on dedicated GPUs
In this example, GPUs 0–3 serve the model with tensor parallelism across all 4 GPUs:
How it works under the hood
When you runtrl vllm-serve --model <model_name>:
- vLLM spawns workers determined by
--tensor-parallel-size × --data-parallel-size. With--tensor-parallel-size 4, it spawns 4 workers. - Incoming prompts are distributed across workers. The model weights are split across GPUs according to
--tensor-parallel-size. - GPUs communicate via NVIDIA’s NCCL library to ensure each GPU processes its correct slice of the requests.
- The trainer sends prompts to the server; the server generates completions via
vllm_client.generate. - Completions are used to compute the reward signal and the training loss.
- After the backward pass, the trainer pushes updated weights to the server via
vllm_client.update_named_param.
The vLLM server handles only generation — it does not train the model. Updated weights are pushed from the trainer to the server after each backward pass.
Server configuration reference
Alltrl vllm-serve arguments:
| Argument | Default | Description |
|---|---|---|
--model | required | Model name or path. |
--revision | — | Model revision (branch, tag, or commit). |
--tensor-parallel-size | 1 | Number of tensor parallel workers. |
--data-parallel-size | 1 | Number of data parallel workers. For dense models, keep at 1 (required for vLLM ≥ 0.14.0). |
--host | 0.0.0.0 | Server host address. |
--port | 8000 | Server port. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory reserved for weights, activations, and KV cache. |
--dtype | auto | Data type for generation. |
--max-model-len | — | Override the model’s maximum context length. |
--enable-prefix-caching | — | Enable prefix caching if the model and hardware support it. |
--enforce-eager | False | Disable CUDA graph and use eager mode only. |
--kv-cache-dtype | auto | Data type for the KV cache. |
--trust-remote-code | False | Allow executing code from model repositories. |
--vllm-model-impl | vllm | Model implementation backend: vllm or transformers. |
Transformers backend
vLLM can use the Transformers backend for model implementations, including vision-language models (VLMs):vllm_model_impl="transformers" in your trainer config or pass it as a CLI argument. See the vLLM Transformers Backend blog post for details.