This guide will help you quickly get started with vLLM to perform offline batched inference and deploy an OpenAI-compatible API server.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Operating system
Linux (including WSL on Windows)
Python version
Python 3.10, 3.11, 3.12, or 3.13
Installation
Choose your hardware platform for installation instructions:- NVIDIA CUDA
- AMD ROCm
- Google TPU
If you are using NVIDIA GPUs, you can install vLLM using pip directly.It’s recommended to use uv, a very fast Python environment manager. Install You can also run vLLM commands without creating a permanent environment:
uv and create a new environment:If you prefer conda, you can also use it to manage environments:
Offline batched inference
With vLLM installed, you can start generating text for a list of input prompts (offline batch inference).Basic usage
The core classes you’ll use are:LLM- Main class for running offline inference with the vLLM engineSamplingParams- Parameters for the sampling process
Define prompts and sampling parameters
Create a list of input prompts and configure sampling parameters:By default, vLLM uses sampling parameters from the model’s
generation_config.json on HuggingFace if it exists. This provides optimal results recommended by the model creator.To use vLLM’s default sampling parameters instead, set generation_config="vllm" when creating the LLM instance.Initialize the LLM engine
Create an LLM instance with your chosen model:Generate outputs
Generate text for all prompts with high throughput:Complete example
See the full working example at:examples/offline_inference/basic/basic.py
OpenAI-compatible server
vLLM can be deployed as a server implementing the OpenAI API protocol, allowing it to be a drop-in replacement for OpenAI API applications.Start the server
Launch the vLLM server with a model:- Starts at
http://localhost:8000 - Hosts one model at a time
- Implements OpenAI-compatible endpoints
By default, the server applies
generation_config.json from the HuggingFace model repository. To disable this and use vLLM defaults, pass:API authentication
Enable API key checking:Query the server
List available models:Completions API
Generate completions using the/v1/completions endpoint:
Chat completions API
The chat interface enables dynamic, back-and-forth exchanges:Attention backends
vLLM automatically selects the most performant attention backend for your system. You can manually specify a backend:- Online serving
- Offline inference
Available backends
NVIDIA CUDA
FLASH_ATTN or FLASHINFERAMD ROCm
TRITON_ATTN, ROCM_ATTN, ROCM_AITER_FA, ROCM_AITER_UNIFIED_ATTN, TRITON_MLA, ROCM_AITER_MLA, or ROCM_AITER_TRITON_MLANext steps
Installation guide
Detailed installation for all platforms
Supported models
Browse all compatible models
API reference
Explore complete API documentation
Configuration
Learn about advanced configuration options