Documentation Index
Fetch the complete documentation index at: https://mintlify.com/cvat-ai/cvat/llms.txt
Use this file to discover all available pages before exploring further.
Deploy CVAT on Kubernetes for production environments requiring high availability, horizontal scaling, and enterprise-grade reliability.
Prerequisites
Kubernetes Cluster
- Kubernetes 1.23.0 or higher
- kubectl configured and connected to your cluster
- Cluster with at least:
- 3 nodes (for high availability)
- 8 CPU cores total
- 16GB RAM total
- 200GB storage
# Install Helm 3
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# Verify installation
helm version
kubectl version --client
Storage Provider
Kubernetes cluster must have a default StorageClass or configure one:
# Check available storage classes
kubectl get storageclass
You need:
- ReadWriteMany (RWX): For shared backend storage
- ReadWriteOnce (RWO): For databases (PostgreSQL, ClickHouse, Kvrocks)
Ingress Controller (Optional)
For external access, install an ingress controller:
# Example: Nginx Ingress
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx
Or enable the embedded Traefik ingress.
Installation
1. Add Helm Repository
Add the CVAT Helm chart repository:
helm repo add cvat https://cvat-ai.github.io/cvat/
helm repo update
2. Create Namespace
kubectl create namespace cvat
3. Basic Installation
Install CVAT with default configuration:
helm install cvat cvat/cvat -n cvat
This creates:
- CVAT backend deployment (server + workers)
- CVAT frontend deployment
- PostgreSQL StatefulSet
- Redis StatefulSet
- Kvrocks StatefulSet
- ClickHouse StatefulSet
- Open Policy Agent deployment
- Vector for log collection
- Grafana for analytics
- Required services and PVCs
4. Wait for Pods to Start
# Watch pod status
kubectl get pods -n cvat -w
# Check all resources
kubectl get all -n cvat
Initialization takes 2-5 minutes.
5. Create Superuser
After all pods are running:
# Find the server pod
kubectl get pods -n cvat | grep cvat-backend-server
# Create superuser
kubectl exec -it -n cvat <cvat-backend-server-pod> -- \
python manage.py createsuperuser
6. Access CVAT
Port Forward (Testing):
kubectl port-forward -n cvat service/cvat-frontend 8000:8000
Then access: http://localhost:8000
Or configure Ingress for production (see below).
Configuration
Custom Values File
Create cvat-values.yaml to customize your deployment:
# cvat-values.yaml
cvat:
backend:
server:
replicas: 2
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
envs:
ALLOWED_HOSTS: '*'
worker:
export:
replicas: 3
import:
replicas: 3
chunks:
replicas: 3
image: cvat/server
tag: v2.10.0 # Use specific version
defaultStorage:
enabled: true
size: 50Gi
storageClassName: fast-ssd
frontend:
replicas: 2
image: cvat/ui
tag: v2.10.0
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
kvrocks:
enabled: true
defaultStorage:
enabled: true
size: 200Gi
storageClassName: fast-ssd
postgresql:
enabled: true
auth:
username: cvat
database: cvat
password: changeme123 # Use strong password
primary:
persistence:
size: 20Gi
storageClass: fast-ssd
redis:
enabled: true
auth:
password: redis_secure_password
master:
persistence:
size: 5Gi
clickhouse:
enabled: true
auth:
username: user
password: clickhouse_password
shards: 1
replicaCount: 1
persistence:
size: 50Gi
analytics:
enabled: true
clickhousePassword: clickhouse_password
ingress:
enabled: true
hostname: cvat.example.com
className: nginx
tls: true
tlsSecretName: cvat-tls
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
Install with custom values:
helm install cvat cvat/cvat -n cvat -f cvat-values.yaml
Ingress Configuration
Using Nginx Ingress
ingress:
enabled: true
hostname: cvat.example.com
className: nginx
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"
tls: true
tlsSecretName: cvat-tls-secret
Using Embedded Traefik
traefik:
enabled: true
service:
type: LoadBalancer
ports:
web:
port: 80
websecure:
port: 443
External Database
Use an external PostgreSQL database:
postgresql:
enabled: false
external:
host: postgres.example.com
port: 5432
auth:
username: cvat
database: cvat
password: secure_password
existingSecret: cvat-postgres-secret
Create secret:
kubectl create secret generic cvat-postgres-secret -n cvat \
--from-literal=password='your-password'
External Redis
redis:
enabled: false
external:
host: redis.example.com
auth:
password: redis_password
existingSecret: cvat-redis-secret
Scaling Workers
Adjust worker replicas based on load:
cvat:
backend:
worker:
export:
replicas: 5
resources:
requests:
cpu: 1000m
memory: 2Gi
import:
replicas: 5
chunks:
replicas: 4
annotation:
replicas: 2
High Availability
For production HA setup:
cvat:
backend:
server:
replicas: 3 # Multiple server replicas
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cvat-backend-server
topologyKey: kubernetes.io/hostname
frontend:
replicas: 3
postgresql:
enabled: true
architecture: replication
replication:
enabled: true
numSynchronousReplicas: 1
readReplicas:
replicaCount: 2
Chart Structure
The CVAT Helm chart (v2.58.1) includes:
Dependencies
Automatically installed:
- postgresql (v12.1.x): Primary database
- redis (v19.6.4): Caching layer
- clickhouse (v4.1.x): Analytics database
- vector (v0.19.x): Log aggregation
- grafana (v6.60.x): Analytics UI
- traefik (v37.3.x): Optional ingress
- nuclio (v0.21.x): Optional serverless functions
Templates
Key Kubernetes resources created:
- Deployments: cvat-backend-server, cvat-frontend, cvat-opa
- StatefulSets: PostgreSQL, Redis, Kvrocks, ClickHouse
- Deployments (Workers): Export, Import, Annotation, Webhooks, Quality Reports, Chunks, Consensus, Utils
- Services: Frontend, Backend, OPA, Databases
- PersistentVolumeClaims: Backend storage, Kvrocks cache, database storage
- ConfigMaps: Application config, Vector config, Grafana dashboards
- Secrets: Database credentials, Redis passwords, ClickHouse auth
- Jobs: Backend initializer (runs migrations)
- Ingress: Optional external access
Operations
Upgrade CVAT
# Update repo
helm repo update
# Check current version
helm list -n cvat
# Upgrade to latest
helm upgrade cvat cvat/cvat -n cvat -f cvat-values.yaml
# Or upgrade to specific version
helm upgrade cvat cvat/cvat -n cvat -f cvat-values.yaml --version 2.58.1
Rollback
# View history
helm history cvat -n cvat
# Rollback to previous
helm rollback cvat -n cvat
# Rollback to specific revision
helm rollback cvat 2 -n cvat
Uninstall
# Remove release
helm uninstall cvat -n cvat
# Delete namespace
kubectl delete namespace cvat
Note: PVCs may need manual deletion.
Backup and Restore
Backup PostgreSQL:
kubectl exec -n cvat <postgresql-pod> -- \
pg_dumpall -U cvat > cvat_backup.sql
Backup PVCs using your storage provider’s snapshot feature or:
# Example using kubectl cp
kubectl exec -n cvat <backend-pod> -- tar czf /tmp/data.tar.gz /home/django/data
kubectl cp cvat/<backend-pod>:/tmp/data.tar.gz ./data_backup.tar.gz
Restore:
kubectl exec -i -n cvat <postgresql-pod> -- psql -U cvat cvat < cvat_backup.sql
View Logs
# Backend server logs
kubectl logs -n cvat -l app=cvat-backend-server -f
# Worker logs
kubectl logs -n cvat -l app=cvat-backend-worker-export -f
# All backend logs
kubectl logs -n cvat -l component=backend -f
Exec into Pods
# Server pod
kubectl exec -it -n cvat <server-pod> -- bash
# Run Django commands
kubectl exec -n cvat <server-pod> -- python manage.py migrate
kubectl exec -n cvat <server-pod> -- python manage.py collectstatic --noinput
Monitoring
Pod Status:
kubectl get pods -n cvat
kubectl top pods -n cvat
Service Status:
Events:
kubectl get events -n cvat --sort-by='.lastTimestamp'
Resource Usage:
kubectl top nodes
kubectl top pods -n cvat
Troubleshooting
Pods Not Starting
Check pod status:
kubectl describe pod -n cvat <pod-name>
Common issues:
- ImagePullBackOff: Check image name and registry access
- CrashLoopBackOff: Check logs for application errors
- Pending: Check storage class and resource availability
Database Connection Issues
# Check PostgreSQL pod
kubectl logs -n cvat -l app.kubernetes.io/name=postgresql
# Test connection from server
kubectl exec -n cvat <server-pod> -- \
python manage.py dbshell
Storage Issues
# Check PVCs
kubectl get pvc -n cvat
# Check PV status
kubectl get pv
# Describe problematic PVC
kubectl describe pvc -n cvat <pvc-name>
Worker Not Processing Jobs
# Check worker logs
kubectl logs -n cvat -l app=cvat-backend-worker-export
# Check Redis connection
kubectl exec -n cvat <redis-pod> -- redis-cli ping
# Restart workers
kubectl rollout restart deployment -n cvat -l component=backend
Ingress Not Working
# Check ingress status
kubectl get ingress -n cvat
kubectl describe ingress -n cvat cvat
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
Advanced Configuration
Custom Storage Classes
cvat:
backend:
defaultStorage:
enabled: true
storageClassName: premium-rwo
accessModes:
- ReadWriteMany
size: 100Gi
kvrocks:
defaultStorage:
storageClassName: fast-ssd
size: 200Gi
volumeAttributesClass:
create: true
name: high-throughput
provider: ebs.csi.aws.com
parameters:
type: gp3
provisioned-throughput: "250"
Node Affinity and Tolerations
cvat:
backend:
server:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute
tolerations:
- key: "dedicated"
operator: "Equal"
value: "cvat"
effect: "NoSchedule"
Additional Environment Variables
cvat:
backend:
additionalEnv:
- name: DJANGO_LOG_LEVEL
value: INFO
- name: SMOKESCREEN_OPTS
value: "--deny-address 169.254.169.254"
server:
additionalEnv:
- name: CVAT_BASE_URL
value: https://cvat.example.com
Custom Volumes
cvat:
backend:
additionalVolumes:
- name: shared-data
nfs:
server: nfs.example.com
path: /exports/cvat
additionalVolumeMounts:
- name: shared-data
mountPath: /mnt/shared
Production Best Practices
- Use specific image tags: Don’t use
dev or latest in production
- Enable resource limits: Prevent resource exhaustion
- Configure HPA: Auto-scale based on CPU/memory
- Use external databases: For better reliability and backups
- Enable monitoring: Use Prometheus/Grafana for metrics
- Regular backups: Automate database and volume backups
- TLS everywhere: Use cert-manager for automatic certificates
- Network policies: Restrict pod-to-pod communication
- Secrets management: Use external secret managers (Vault, AWS Secrets Manager)
- Multi-zone deployment: Spread pods across availability zones
cvat:
backend:
server:
replicas: 5
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
worker:
chunks:
replicas: 10
resources:
requests:
cpu: 1000m
memory: 2Gi
postgresql:
primary:
resources:
requests:
cpu: 2000m
memory: 4Gi
persistence:
size: 100Gi
storageClass: premium-ssd
redis:
master:
resources:
requests:
cpu: 1000m
memory: 2Gi
Next Steps