Docker deployment

Overview

vLLM provides official Docker images for both GPU and CPU deployments. Docker containers ensure consistent environments and simplify deployment across different platforms.

Pre-built images

vLLM publishes pre-built Docker images to Docker Hub and public ECR registries:

GPU images

# Latest stable release
docker pull vllm/vllm-openai:latest

# Specific version
docker pull vllm/vllm-openai:v0.6.4

CPU images

# x86_64 CPU
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest

# ARM64 CPU
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest

Specialized images

vLLM provides images for different hardware backends:

ROCm (AMD GPUs): vllm/vllm-openai:latest-rocm
TPU: Images available via Google Cloud Artifact Registry
XPU (Intel GPUs): Custom builds available

Running vLLM with Docker

Basic GPU deployment

Run vLLM with a single GPU:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-1B-Instruct

The --ipc=host flag is required for shared memory access in tensor parallel inference.

Configure shared memory

For larger models, increase shared memory:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --shm-size=10.24gb \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-7b-chat-hf \
  --tensor-parallel-size 2

Use environment variables

Pass Hugging Face token and other configurations:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN=your_token_here \
  -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct

CPU deployment

For CPU-only deployments:

docker run -d \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN=your_token_here \
  -e VLLM_CPU_KVCACHE_SPACE=40 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-63 \
  -p 8000:8000 \
  public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest \
  --model meta-llama/Llama-3.2-1B-Instruct

CPU performance is significantly lower than GPU. Use CPU deployment only for testing or when GPUs are unavailable.

Building from source

Clone the repository

git clone https://github.com/vllm-project/vllm.git
cd vllm

Build GPU image

Build the default CUDA image:

docker build -f docker/Dockerfile . --tag vllm-custom:latest

The build process uses Docker BuildKit for layer caching and parallel builds.

Build with custom CUDA version

docker build -f docker/Dockerfile . \
  --build-arg CUDA_VERSION=12.9.1 \
  --build-arg PYTHON_VERSION=3.12 \
  --tag vllm-custom:cuda12.9

Build CPU image

docker build -f docker/Dockerfile.cpu . \
  --platform=linux/amd64 \
  --tag vllm-cpu:latest

Build ROCm image for AMD GPUs

docker build -f docker/Dockerfile.rocm . \
  --tag vllm-rocm:latest

Build arguments

Common build arguments for customization:

Argument	Default	Description
`CUDA_VERSION`	`12.9.1`	CUDA toolkit version
`PYTHON_VERSION`	`3.12`	Python version
`PYTORCH_NIGHTLY`	`0`	Use PyTorch nightly builds
`MAX_JOBS`	`2`	Parallel build jobs
`TORCH_CUDA_ARCH_LIST`	`7.0 7.5 8.0 8.9 9.0 10.0 12.0`	Target GPU architectures
`INSTALL_KV_CONNECTORS`	`false`	Install KV connector dependencies

Advanced configurations

Multi-GPU deployment

docker run --runtime nvidia --gpus '"device=0,1,2,3"' \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --shm-size=16g \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4

With proxy settings

docker build -f docker/Dockerfile . \
  --build-arg http_proxy=$http_proxy \
  --build-arg https_proxy=$https_proxy \
  --tag vllm-custom:latest

Docker Compose example

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    shm_size: 10gb
    ipc: host
    command: >
      --model meta-llama/Llama-3.2-1B-Instruct
      --trust-remote-code

Load balancing with Nginx

For multiple vLLM instances behind a load balancer, see the production guide.

Troubleshooting

CUDA compatibility issues

Enable CUDA forward compatibility for older drivers:

docker run --runtime nvidia --gpus all \
  -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-1B-Instruct

Out of memory errors

Increase shared memory with --shm-size
Reduce max-model-len parameter
Enable quantization (INT8, FP8)

Permission errors

Run with user namespace mapping:

docker run --runtime nvidia --gpus all \
  --user $(id -u):$(id -g) \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-1B-Instruct

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Docker deployment

Overview

Pre-built images

GPU images

CPU images

Specialized images

Running vLLM with Docker

CPU deployment

Building from source

Build arguments

Advanced configurations

Multi-GPU deployment

With proxy settings

Docker Compose example

Load balancing with Nginx

Troubleshooting

CUDA compatibility issues

Out of memory errors

Permission errors

Next steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Documentation Index

​Overview

​Pre-built images

​GPU images

​CPU images

​Specialized images

​Running vLLM with Docker

​CPU deployment

​Building from source

​Build arguments

​Advanced configurations

​Multi-GPU deployment

​With proxy settings

​Docker Compose example

​Load balancing with Nginx

​Troubleshooting

​CUDA compatibility issues

​Out of memory errors

​Permission errors

​Next steps

Build docs developers (and LLMs) love

Overview

Pre-built images

GPU images

CPU images

Specialized images

Running vLLM with Docker

CPU deployment

Building from source

Build arguments

Advanced configurations

Multi-GPU deployment

With proxy settings

Docker Compose example

Load balancing with Nginx

Troubleshooting

CUDA compatibility issues

Out of memory errors

Permission errors

Next steps