Documentation Index
Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This guide covers best practices, architectural patterns, and operational considerations for running vLLM in production at scale.
Architecture patterns
Single-instance deployment
Simplest deployment for low-to-medium traffic:
┌─────────────┐
│ Client │
└──────┬──────┘
│
v
┌─────────────┐
│ vLLM Pod │
│ (1x GPU) │
└─────────────┘
Use when:
- QPS < 10
- Single model serving
- Development/testing environments
Load-balanced deployment
Multiple replicas behind a load balancer:
┌─────────────┐
│ Client │
└──────┬──────┘
│
v
┌─────────────────┐
│ Load Balancer │
│ (Nginx/K8s) │
└────────┬────────┘
│
┌────┴────┬────────┬────────┐
v v v v
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ vLLM │ │ vLLM │ │ vLLM │ │ vLLM │
│ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod N │
└───────┘ └───────┘ └───────┘ └───────┘
Use when:
- QPS > 10
- High availability required
- Horizontal scaling needed
Multi-model deployment
Serve multiple models with routing:
┌─────────────┐
│ Client │
└──────┬──────┘
│
v
┌─────────────────┐
│ Model Router │
└────────┬────────┘
│
┌────┴─────┬──────────┐
v v v
┌────────┐ ┌────────┐ ┌────────┐
│ Model │ │ Model │ │ Model │
│ 7B │ │ 13B │ │ 70B │
└────────┘ └────────┘ └────────┘
Use when:
- Multiple models needed
- Different performance tiers
- Cost optimization
Load balancing
Nginx configuration
Create Nginx configuration
upstream vllm_backend {
least_conn; # Use least connections algorithm
server vllm0:8000 max_fails=3 fail_timeout=30s;
server vllm1:8000 max_fails=3 fail_timeout=30s;
server vllm2:8000 max_fails=3 fail_timeout=30s;
server vllm3:8000 max_fails=3 fail_timeout=30s;
keepalive 32; # Connection pooling
}
server {
listen 80;
location / {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts for long-running requests
proxy_connect_timeout 300s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
}
location /health {
proxy_pass http://vllm_backend/health;
proxy_http_version 1.1;
}
}
Deploy with Docker Compose
version: '3.8'
services:
nginx:
image: nginx:latest
ports:
- "8000:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- vllm0
- vllm1
- vllm2
- vllm3
networks:
- vllm-network
vllm0:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
vllm1:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=1
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
vllm2:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=2
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
vllm3:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=3
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
networks:
vllm-network:
driver: bridge
Kubernetes Service with session affinity
Enable prefix caching by routing requests to the same pod:
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600 # 1 hour
Session affinity improves cache hit rates for prefix caching, reducing latency and cost.
Model configuration
Optimal vLLM settings for production:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--disable-log-requests \
--trust-remote-code
Key parameters:
| Parameter | Recommended | Purpose |
|---|
gpu-memory-utilization | 0.85-0.90 | Leave headroom for fragmentation |
max-model-len | Model-specific | Reduce for higher throughput |
max-num-seqs | 128-256 | Balance latency vs throughput |
enable-prefix-caching | true | Cache common prompts |
enable-chunked-prefill | true | Reduce TTFT for long prompts |
disable-log-requests | true | Reduce logging overhead |
Quantization
Reduce memory usage and increase throughput:
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 4
Quantization comparison:
| Method | Memory Savings | Quality | Speed |
|---|
| FP16 (baseline) | 0% | 100% | 1.0x |
| FP8 | 50% | 98-99% | 1.5-2.0x |
| AWQ/GPTQ | 75% | 95-98% | 1.2-1.5x |
Multi-GPU tensor parallelism
For large models, split across multiple GPUs:
# 70B model on 4x A100 GPUs
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
Tensor parallelism requires high-bandwidth interconnects (NVLink, InfiniBand). Use on single-node multi-GPU systems.
Monitoring and observability
Prometheus metrics
vLLM exposes Prometheus metrics at /metrics:
apiVersion: v1
kind: Service
metadata:
name: vllm-metrics
labels:
app: vllm
spec:
ports:
- name: metrics
port: 8000
targetPort: 8000
selector:
app: vllm
---
apiVersion: v1
kind: ServiceMonitor
metadata:
name: vllm-monitor
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: metrics
path: /metrics
Key metrics to monitor:
vllm:num_requests_running - Active requests
vllm:num_requests_waiting - Queued requests
vllm:gpu_cache_usage_perc - GPU memory utilization
vllm:avg_generation_throughput_toks_per_s - Throughput
vllm:time_to_first_token_seconds - TTFT latency
vllm:time_per_output_token_seconds - Generation latency
Grafana dashboard
Example Grafana queries:
# Request rate
rate(vllm:request_success_total[5m])
# Average TTFT
rate(vllm:time_to_first_token_seconds_sum[5m]) / rate(vllm:time_to_first_token_seconds_count[5m])
# P95 generation latency
histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
# GPU utilization
vllm:gpu_cache_usage_perc
OpenTelemetry tracing
Enable distributed tracing:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--otlp-traces-endpoint http://jaeger:4318/v1/traces
Health checks and probes
Kubernetes probes
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 60 # 10 minutes for large models
Set failureThreshold high enough for large models to load. A 70B model can take 5-10 minutes to initialize.
Autoscaling
Horizontal Pod Autoscaler (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "50" # Scale when >50 concurrent requests per pod
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 30
SkyPilot autoscaling
service:
replica_policy:
min_replicas: 2
max_replicas: 10
target_qps_per_replica: 5 # Scale when QPS > 5 per replica
upscale_delay_seconds: 60
downscale_delay_seconds: 300
Security best practices
API authentication
Use API keys for authentication:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--api-key your-secret-key
Client usage:
import openai
client = openai.OpenAI(
base_url="http://vllm:8000/v1",
api_key="your-secret-key"
)
Network policies
Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vllm-network-policy
spec:
podSelector:
matchLabels:
app: vllm
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
role: api-gateway
ports:
- protocol: TCP
port: 8000
egress:
- to:
- podSelector:
matchLabels:
role: model-storage
Secrets management
Use Kubernetes secrets or cloud secret managers:
apiVersion: v1
kind: Secret
metadata:
name: hf-token
type: Opaque
stringData:
token: hf_xxxxxxxxxxxxx
---
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
Disaster recovery
Model checkpointing
Store models in persistent storage:
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-models
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
Multi-region deployment
Deploy across multiple regions for high availability:
┌──────────────┐ ┌──────────────┐
│ Region 1 │ │ Region 2 │
│ (Primary) │ │ (Failover) │
├──────────────┤ ├──────────────┤
│ vLLM Cluster │ │ vLLM Cluster │
│ (3 pods) │ │ (3 pods) │
└──────────────┘ └──────────────┘
│ │
└────────┬───────────┘
v
┌──────────────┐
│ Global Load │
│ Balancer │
└──────────────┘
Cost optimization
Right-size GPU allocation
Match GPU to model size:
- 7B models: T4 (16GB) or L4 (24GB)
- 13B models: L4 (24GB) or A10G (24GB)
- 70B models: A100 40GB x2 or A100 80GB x1
Use quantization
Reduce GPU requirements with AWQ/GPTQ/FP8 quantization.
Enable autoscaling
Scale to zero during off-peak hours.
Batch requests
Use continuous batching to maximize throughput.
Enable prefix caching
Cache common system prompts to reduce compute.
Troubleshooting
High latency
Symptoms: Slow response times
Solutions:
- Check GPU utilization with
nvidia-smi
- Reduce
max-model-len to free memory
- Enable chunked prefill
- Add more replicas
- Enable quantization
OOM errors
Symptoms: CUDA out of memory
Solutions:
- Reduce
gpu-memory-utilization to 0.85
- Reduce
max-num-seqs
- Reduce
max-model-len
- Enable quantization
- Use tensor parallelism
Request timeouts
Symptoms: 504 Gateway Timeout
Solutions:
- Increase proxy timeouts in Nginx/K8s
- Increase
readinessProbe timeout
- Check for deadlocked requests with metrics
- Review
max-num-batched-tokens
Checklist
Before going to production:
Next steps