Production deployment guide

Overview

This guide covers best practices, architectural patterns, and operational considerations for running vLLM in production at scale.

Architecture patterns

Single-instance deployment

Simplest deployment for low-to-medium traffic:

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       v
┌─────────────┐
│  vLLM Pod   │
│  (1x GPU)   │
└─────────────┘

Use when:

QPS < 10
Single model serving
Development/testing environments

Load-balanced deployment

Multiple replicas behind a load balancer:

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       v
┌─────────────────┐
│ Load Balancer   │
│  (Nginx/K8s)    │
└────────┬────────┘
         │
    ┌────┴────┬────────┬────────┐
    v         v        v        v
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ vLLM  │ │ vLLM  │ │ vLLM  │ │ vLLM  │
│ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod N │
└───────┘ └───────┘ └───────┘ └───────┘

Use when:

QPS > 10
High availability required
Horizontal scaling needed

Multi-model deployment

Serve multiple models with routing:

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       v
┌─────────────────┐
│  Model Router   │
└────────┬────────┘
         │
    ┌────┴─────┬──────────┐
    v          v          v
┌────────┐ ┌────────┐ ┌────────┐
│ Model  │ │ Model  │ │ Model  │
│  7B    │ │  13B   │ │  70B   │
└────────┘ └────────┘ └────────┘

Use when:

Multiple models needed
Different performance tiers
Cost optimization

Load balancing

Nginx configuration

Create Nginx configuration

upstream vllm_backend {
    least_conn;  # Use least connections algorithm
    server vllm0:8000 max_fails=3 fail_timeout=30s;
    server vllm1:8000 max_fails=3 fail_timeout=30s;
    server vllm2:8000 max_fails=3 fail_timeout=30s;
    server vllm3:8000 max_fails=3 fail_timeout=30s;
    
    keepalive 32;  # Connection pooling
}

server {
    listen 80;
    
    location / {
        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Timeouts for long-running requests
        proxy_connect_timeout 300s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
    
    location /health {
        proxy_pass http://vllm_backend/health;
        proxy_http_version 1.1;
    }
}

Deploy with Docker Compose

version: '3.8'

services:
  nginx:
    image: nginx:latest
    ports:
      - "8000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - vllm0
      - vllm1
      - vllm2
      - vllm3
    networks:
      - vllm-network

  vllm0:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm1:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm2:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=2
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm3:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=3
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

networks:
  vllm-network:
    driver: bridge

Kubernetes Service with session affinity

Enable prefix caching by routing requests to the same pod:

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600  # 1 hour

Session affinity improves cache hit rates for prefix caching, reducing latency and cost.

Performance optimization

Model configuration

Optimal vLLM settings for production:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --disable-log-requests \
  --trust-remote-code

Key parameters:

Parameter	Recommended	Purpose
`gpu-memory-utilization`	`0.85-0.90`	Leave headroom for fragmentation
`max-model-len`	Model-specific	Reduce for higher throughput
`max-num-seqs`	`128-256`	Balance latency vs throughput
`enable-prefix-caching`	`true`	Cache common prompts
`enable-chunked-prefill`	`true`	Reduce TTFT for long prompts
`disable-log-requests`	`true`	Reduce logging overhead

Quantization

Reduce memory usage and increase throughput:

vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 4

Quantization comparison:

Method	Memory Savings	Quality	Speed
FP16 (baseline)	0%	100%	1.0x
FP8	50%	98-99%	1.5-2.0x
AWQ/GPTQ	75%	95-98%	1.2-1.5x

Multi-GPU tensor parallelism

For large models, split across multiple GPUs:

# 70B model on 4x A100 GPUs
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192

Tensor parallelism requires high-bandwidth interconnects (NVLink, InfiniBand). Use on single-node multi-GPU systems.

Monitoring and observability

Prometheus metrics

vLLM exposes Prometheus metrics at /metrics:

apiVersion: v1
kind: Service
metadata:
  name: vllm-metrics
  labels:
    app: vllm
spec:
  ports:
  - name: metrics
    port: 8000
    targetPort: 8000
  selector:
    app: vllm
---
apiVersion: v1
kind: ServiceMonitor
metadata:
  name: vllm-monitor
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
  - port: metrics
    path: /metrics

Key metrics to monitor:

vllm:num_requests_running - Active requests
vllm:num_requests_waiting - Queued requests
vllm:gpu_cache_usage_perc - GPU memory utilization
vllm:avg_generation_throughput_toks_per_s - Throughput
vllm:time_to_first_token_seconds - TTFT latency
vllm:time_per_output_token_seconds - Generation latency

Grafana dashboard

Example Grafana queries:

# Request rate
rate(vllm:request_success_total[5m])

# Average TTFT
rate(vllm:time_to_first_token_seconds_sum[5m]) / rate(vllm:time_to_first_token_seconds_count[5m])

# P95 generation latency
histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))

# GPU utilization
vllm:gpu_cache_usage_perc

OpenTelemetry tracing

Enable distributed tracing:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --otlp-traces-endpoint http://jaeger:4318/v1/traces

Health checks and probes

Kubernetes probes

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 300
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 60  # 10 minutes for large models

Set failureThreshold high enough for large models to load. A 70B model can take 5-10 minutes to initialize.

Autoscaling

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_running
      target:
        type: AverageValue
        averageValue: "50"  # Scale when >50 concurrent requests per pod
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

SkyPilot autoscaling

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 10
    target_qps_per_replica: 5  # Scale when QPS > 5 per replica
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

Security best practices

API authentication

Use API keys for authentication:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --api-key your-secret-key

Client usage:

import openai

client = openai.OpenAI(
    base_url="http://vllm:8000/v1",
    api_key="your-secret-key"
)

Network policies

Restrict pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-network-policy
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: api-gateway
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - podSelector:
        matchLabels:
          role: model-storage

Secrets management

Use Kubernetes secrets or cloud secret managers:

apiVersion: v1
kind: Secret
metadata:
  name: hf-token
type: Opaque
stringData:
  token: hf_xxxxxxxxxxxxx
---
env:
- name: HF_TOKEN
  valueFrom:
    secretKeyRef:
      name: hf-token
      key: token

Disaster recovery

Model checkpointing

Store models in persistent storage:

volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: vllm-models
volumeMounts:
- name: model-cache
  mountPath: /root/.cache/huggingface

Multi-region deployment

Deploy across multiple regions for high availability:

┌──────────────┐     ┌──────────────┐
│  Region 1    │     │  Region 2    │
│  (Primary)   │     │  (Failover)  │
├──────────────┤     ├──────────────┤
│ vLLM Cluster │     │ vLLM Cluster │
│  (3 pods)    │     │  (3 pods)    │
└──────────────┘     └──────────────┘
        │                    │
        └────────┬───────────┘
                 v
         ┌──────────────┐
         │ Global Load  │
         │  Balancer    │
         └──────────────┘

Cost optimization

Right-size GPU allocation

Match GPU to model size:

7B models: T4 (16GB) or L4 (24GB)
13B models: L4 (24GB) or A10G (24GB)
70B models: A100 40GB x2 or A100 80GB x1

Use quantization

Reduce GPU requirements with AWQ/GPTQ/FP8 quantization.

Enable autoscaling

Scale to zero during off-peak hours.

Batch requests

Use continuous batching to maximize throughput.

Enable prefix caching

Cache common system prompts to reduce compute.

Troubleshooting

High latency

Symptoms: Slow response times Solutions:

Check GPU utilization with nvidia-smi
Reduce max-model-len to free memory
Enable chunked prefill
Add more replicas
Enable quantization

OOM errors

Symptoms: CUDA out of memory Solutions:

Reduce gpu-memory-utilization to 0.85
Reduce max-num-seqs
Reduce max-model-len
Enable quantization
Use tensor parallelism

Request timeouts

Symptoms: 504 Gateway Timeout Solutions:

Increase proxy timeouts in Nginx/K8s
Increase readinessProbe timeout
Check for deadlocked requests with metrics
Review max-num-batched-tokens

Checklist

Before going to production:

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Documentation Index

​Overview

​Architecture patterns

​Single-instance deployment

​Load-balanced deployment

​Multi-model deployment

​Load balancing

​Nginx configuration

​Kubernetes Service with session affinity

​Performance optimization

​Model configuration

​Quantization

​Multi-GPU tensor parallelism

​Monitoring and observability

​Prometheus metrics

​Grafana dashboard

​OpenTelemetry tracing

​Health checks and probes

​Kubernetes probes

​Autoscaling

​Horizontal Pod Autoscaler (HPA)

​SkyPilot autoscaling

​Security best practices

​API authentication

​Network policies

​Secrets management

​Disaster recovery

​Model checkpointing

​Multi-region deployment

​Cost optimization

​Troubleshooting

​High latency

​OOM errors

​Request timeouts

​Checklist

​Next steps

Build docs developers (and LLMs) love

Overview

Architecture patterns

Single-instance deployment

Load-balanced deployment

Multi-model deployment

Load balancing

Nginx configuration

Kubernetes Service with session affinity

Performance optimization

Model configuration

Quantization

Multi-GPU tensor parallelism

Monitoring and observability

Prometheus metrics

Grafana dashboard

OpenTelemetry tracing

Health checks and probes

Kubernetes probes

Autoscaling

Horizontal Pod Autoscaler (HPA)

SkyPilot autoscaling

Security best practices

API authentication

Network policies

Secrets management

Disaster recovery

Model checkpointing

Multi-region deployment

Cost optimization

Troubleshooting

High latency

OOM errors

Request timeouts

Checklist

Next steps