Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt

Use this file to discover all available pages before exploring further.

Overview

vLLM provides official Docker images for both GPU and CPU deployments. Docker containers ensure consistent environments and simplify deployment across different platforms.

Pre-built images

vLLM publishes pre-built Docker images to Docker Hub and public ECR registries:

GPU images

# Latest stable release
docker pull vllm/vllm-openai:latest

# Specific version
docker pull vllm/vllm-openai:v0.6.4

CPU images

# x86_64 CPU
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest

# ARM64 CPU
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest

Specialized images

vLLM provides images for different hardware backends:
  • ROCm (AMD GPUs): vllm/vllm-openai:latest-rocm
  • TPU: Images available via Google Cloud Artifact Registry
  • XPU (Intel GPUs): Custom builds available

Running vLLM with Docker

1

Basic GPU deployment

Run vLLM with a single GPU:
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-1B-Instruct
The --ipc=host flag is required for shared memory access in tensor parallel inference.
2

Configure shared memory

For larger models, increase shared memory:
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --shm-size=10.24gb \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-7b-chat-hf \
  --tensor-parallel-size 2
3

Use environment variables

Pass Hugging Face token and other configurations:
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN=your_token_here \
  -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct

CPU deployment

For CPU-only deployments:
docker run -d \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN=your_token_here \
  -e VLLM_CPU_KVCACHE_SPACE=40 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-63 \
  -p 8000:8000 \
  public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest \
  --model meta-llama/Llama-3.2-1B-Instruct
CPU performance is significantly lower than GPU. Use CPU deployment only for testing or when GPUs are unavailable.

Building from source

1

Clone the repository

git clone https://github.com/vllm-project/vllm.git
cd vllm
2

Build GPU image

Build the default CUDA image:
docker build -f docker/Dockerfile . --tag vllm-custom:latest
The build process uses Docker BuildKit for layer caching and parallel builds.
3

Build with custom CUDA version

docker build -f docker/Dockerfile . \
  --build-arg CUDA_VERSION=12.9.1 \
  --build-arg PYTHON_VERSION=3.12 \
  --tag vllm-custom:cuda12.9
4

Build CPU image

docker build -f docker/Dockerfile.cpu . \
  --platform=linux/amd64 \
  --tag vllm-cpu:latest
5

Build ROCm image for AMD GPUs

docker build -f docker/Dockerfile.rocm . \
  --tag vllm-rocm:latest

Build arguments

Common build arguments for customization:
ArgumentDefaultDescription
CUDA_VERSION12.9.1CUDA toolkit version
PYTHON_VERSION3.12Python version
PYTORCH_NIGHTLY0Use PyTorch nightly builds
MAX_JOBS2Parallel build jobs
TORCH_CUDA_ARCH_LIST7.0 7.5 8.0 8.9 9.0 10.0 12.0Target GPU architectures
INSTALL_KV_CONNECTORSfalseInstall KV connector dependencies

Advanced configurations

Multi-GPU deployment

docker run --runtime nvidia --gpus '"device=0,1,2,3"' \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --shm-size=16g \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4

With proxy settings

docker build -f docker/Dockerfile . \
  --build-arg http_proxy=$http_proxy \
  --build-arg https_proxy=$https_proxy \
  --tag vllm-custom:latest

Docker Compose example

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    shm_size: 10gb
    ipc: host
    command: >
      --model meta-llama/Llama-3.2-1B-Instruct
      --trust-remote-code

Load balancing with Nginx

For multiple vLLM instances behind a load balancer, see the production guide.

Troubleshooting

CUDA compatibility issues

Enable CUDA forward compatibility for older drivers:
docker run --runtime nvidia --gpus all \
  -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-1B-Instruct

Out of memory errors

  1. Increase shared memory with --shm-size
  2. Reduce max-model-len parameter
  3. Enable quantization (INT8, FP8)

Permission errors

Run with user namespace mapping:
docker run --runtime nvidia --gpus all \
  --user $(id -u):$(id -g) \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-1B-Instruct

Next steps

Build docs developers (and LLMs) love