Kubernetes Deployment with Helm

Deploy CVAT on Kubernetes for production environments requiring high availability, horizontal scaling, and enterprise-grade reliability.

Prerequisites

Kubernetes Cluster

Kubernetes 1.23.0 or higher
kubectl configured and connected to your cluster
Cluster with at least:
- 3 nodes (for high availability)
- 8 CPU cores total
- 16GB RAM total
- 200GB storage

Required Tools

# Install Helm 3
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Verify installation
helm version
kubectl version --client

Storage Provider

Kubernetes cluster must have a default StorageClass or configure one:

# Check available storage classes
kubectl get storageclass

You need:

ReadWriteMany (RWX): For shared backend storage
ReadWriteOnce (RWO): For databases (PostgreSQL, ClickHouse, Kvrocks)

Ingress Controller (Optional)

For external access, install an ingress controller:

# Example: Nginx Ingress
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx

Or enable the embedded Traefik ingress.

Installation

1. Add Helm Repository

Add the CVAT Helm chart repository:

helm repo add cvat https://cvat-ai.github.io/cvat/
helm repo update

2. Create Namespace

kubectl create namespace cvat

3. Basic Installation

Install CVAT with default configuration:

helm install cvat cvat/cvat -n cvat

This creates:

CVAT backend deployment (server + workers)
CVAT frontend deployment
PostgreSQL StatefulSet
Redis StatefulSet
Kvrocks StatefulSet
ClickHouse StatefulSet
Open Policy Agent deployment
Vector for log collection
Grafana for analytics
Required services and PVCs

4. Wait for Pods to Start

# Watch pod status
kubectl get pods -n cvat -w

# Check all resources
kubectl get all -n cvat

Initialization takes 2-5 minutes.

5. Create Superuser

After all pods are running:

# Find the server pod
kubectl get pods -n cvat | grep cvat-backend-server

# Create superuser
kubectl exec -it -n cvat <cvat-backend-server-pod> -- \
  python manage.py createsuperuser

6. Access CVAT

Port Forward (Testing):

kubectl port-forward -n cvat service/cvat-frontend 8000:8000

Then access: http://localhost:8000 Or configure Ingress for production (see below).

Configuration

Custom Values File

Create cvat-values.yaml to customize your deployment:

# cvat-values.yaml
cvat:
  backend:
    server:
      replicas: 2
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: 2000m
          memory: 4Gi
      envs:
        ALLOWED_HOSTS: '*'
    worker:
      export:
        replicas: 3
      import:
        replicas: 3
      chunks:
        replicas: 3
    image: cvat/server
    tag: v2.10.0  # Use specific version
    defaultStorage:
      enabled: true
      size: 50Gi
      storageClassName: fast-ssd

  frontend:
    replicas: 2
    image: cvat/ui
    tag: v2.10.0
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 500m
        memory: 512Mi

  kvrocks:
    enabled: true
    defaultStorage:
      enabled: true
      size: 200Gi
      storageClassName: fast-ssd

postgresql:
  enabled: true
  auth:
    username: cvat
    database: cvat
    password: changeme123  # Use strong password
  primary:
    persistence:
      size: 20Gi
      storageClass: fast-ssd

redis:
  enabled: true
  auth:
    password: redis_secure_password
  master:
    persistence:
      size: 5Gi

clickhouse:
  enabled: true
  auth:
    username: user
    password: clickhouse_password
  shards: 1
  replicaCount: 1
  persistence:
    size: 50Gi

analytics:
  enabled: true
  clickhousePassword: clickhouse_password

ingress:
  enabled: true
  hostname: cvat.example.com
  className: nginx
  tls: true
  tlsSecretName: cvat-tls
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod

Install with custom values:

helm install cvat cvat/cvat -n cvat -f cvat-values.yaml

Ingress Configuration

Using Nginx Ingress

ingress:
  enabled: true
  hostname: cvat.example.com
  className: nginx
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
    nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"
  tls: true
  tlsSecretName: cvat-tls-secret

Using Embedded Traefik

traefik:
  enabled: true
  service:
    type: LoadBalancer
  ports:
    web:
      port: 80
    websecure:
      port: 443

External Database

Use an external PostgreSQL database:

postgresql:
  enabled: false
  external:
    host: postgres.example.com
    port: 5432
  auth:
    username: cvat
    database: cvat
    password: secure_password
    existingSecret: cvat-postgres-secret

Create secret:

kubectl create secret generic cvat-postgres-secret -n cvat \
  --from-literal=password='your-password'

External Redis

redis:
  enabled: false
  external:
    host: redis.example.com
  auth:
    password: redis_password
    existingSecret: cvat-redis-secret

Scaling Workers

Adjust worker replicas based on load:

cvat:
  backend:
    worker:
      export:
        replicas: 5
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
      import:
        replicas: 5
      chunks:
        replicas: 4
      annotation:
        replicas: 2

High Availability

For production HA setup:

cvat:
  backend:
    server:
      replicas: 3  # Multiple server replicas
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - cvat-backend-server
              topologyKey: kubernetes.io/hostname

  frontend:
    replicas: 3

postgresql:
  enabled: true
  architecture: replication
  replication:
    enabled: true
    numSynchronousReplicas: 1
  readReplicas:
    replicaCount: 2

Chart Structure

The CVAT Helm chart (v2.58.1) includes:

Dependencies

Automatically installed:

postgresql (v12.1.x): Primary database
redis (v19.6.4): Caching layer
clickhouse (v4.1.x): Analytics database
vector (v0.19.x): Log aggregation
grafana (v6.60.x): Analytics UI
traefik (v37.3.x): Optional ingress
nuclio (v0.21.x): Optional serverless functions

Templates

Key Kubernetes resources created:

Deployments: cvat-backend-server, cvat-frontend, cvat-opa
StatefulSets: PostgreSQL, Redis, Kvrocks, ClickHouse
Deployments (Workers): Export, Import, Annotation, Webhooks, Quality Reports, Chunks, Consensus, Utils
Services: Frontend, Backend, OPA, Databases
PersistentVolumeClaims: Backend storage, Kvrocks cache, database storage
ConfigMaps: Application config, Vector config, Grafana dashboards
Secrets: Database credentials, Redis passwords, ClickHouse auth
Jobs: Backend initializer (runs migrations)
Ingress: Optional external access

Operations

Upgrade CVAT

# Update repo
helm repo update

# Check current version
helm list -n cvat

# Upgrade to latest
helm upgrade cvat cvat/cvat -n cvat -f cvat-values.yaml

# Or upgrade to specific version
helm upgrade cvat cvat/cvat -n cvat -f cvat-values.yaml --version 2.58.1

Rollback

# View history
helm history cvat -n cvat

# Rollback to previous
helm rollback cvat -n cvat

# Rollback to specific revision
helm rollback cvat 2 -n cvat

Uninstall

# Remove release
helm uninstall cvat -n cvat

# Delete namespace
kubectl delete namespace cvat

Note: PVCs may need manual deletion.

Backup and Restore

Backup PostgreSQL:

kubectl exec -n cvat <postgresql-pod> -- \
  pg_dumpall -U cvat > cvat_backup.sql

Backup PVCs using your storage provider’s snapshot feature or:

# Example using kubectl cp
kubectl exec -n cvat <backend-pod> -- tar czf /tmp/data.tar.gz /home/django/data
kubectl cp cvat/<backend-pod>:/tmp/data.tar.gz ./data_backup.tar.gz

Restore:

kubectl exec -i -n cvat <postgresql-pod> -- psql -U cvat cvat < cvat_backup.sql

View Logs

# Backend server logs
kubectl logs -n cvat -l app=cvat-backend-server -f

# Worker logs
kubectl logs -n cvat -l app=cvat-backend-worker-export -f

# All backend logs
kubectl logs -n cvat -l component=backend -f

Exec into Pods

# Server pod
kubectl exec -it -n cvat <server-pod> -- bash

# Run Django commands
kubectl exec -n cvat <server-pod> -- python manage.py migrate
kubectl exec -n cvat <server-pod> -- python manage.py collectstatic --noinput

Monitoring

Pod Status:

kubectl get pods -n cvat
kubectl top pods -n cvat

Service Status:

kubectl get svc -n cvat

Events:

kubectl get events -n cvat --sort-by='.lastTimestamp'

Resource Usage:

kubectl top nodes
kubectl top pods -n cvat

Troubleshooting

Pods Not Starting

Check pod status:

kubectl describe pod -n cvat <pod-name>

Common issues:

ImagePullBackOff: Check image name and registry access
CrashLoopBackOff: Check logs for application errors
Pending: Check storage class and resource availability

Database Connection Issues

# Check PostgreSQL pod
kubectl logs -n cvat -l app.kubernetes.io/name=postgresql

# Test connection from server
kubectl exec -n cvat <server-pod> -- \
  python manage.py dbshell

Storage Issues

# Check PVCs
kubectl get pvc -n cvat

# Check PV status
kubectl get pv

# Describe problematic PVC
kubectl describe pvc -n cvat <pvc-name>

Worker Not Processing Jobs

# Check worker logs
kubectl logs -n cvat -l app=cvat-backend-worker-export

# Check Redis connection
kubectl exec -n cvat <redis-pod> -- redis-cli ping

# Restart workers
kubectl rollout restart deployment -n cvat -l component=backend

Ingress Not Working

# Check ingress status
kubectl get ingress -n cvat
kubectl describe ingress -n cvat cvat

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

Advanced Configuration

Custom Storage Classes

cvat:
  backend:
    defaultStorage:
      enabled: true
      storageClassName: premium-rwo
      accessModes:
        - ReadWriteMany
      size: 100Gi

  kvrocks:
    defaultStorage:
      storageClassName: fast-ssd
      size: 200Gi
      volumeAttributesClass:
        create: true
        name: high-throughput
        provider: ebs.csi.aws.com
        parameters:
          type: gp3
          provisioned-throughput: "250"

Node Affinity and Tolerations

cvat:
  backend:
    server:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - compute
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "cvat"
        effect: "NoSchedule"

Additional Environment Variables

cvat:
  backend:
    additionalEnv:
    - name: DJANGO_LOG_LEVEL
      value: INFO
    - name: SMOKESCREEN_OPTS
      value: "--deny-address 169.254.169.254"
    server:
      additionalEnv:
      - name: CVAT_BASE_URL
        value: https://cvat.example.com

Custom Volumes

cvat:
  backend:
    additionalVolumes:
    - name: shared-data
      nfs:
        server: nfs.example.com
        path: /exports/cvat
    additionalVolumeMounts:
    - name: shared-data
      mountPath: /mnt/shared

Production Best Practices

Use specific image tags: Don’t use dev or latest in production
Enable resource limits: Prevent resource exhaustion
Configure HPA: Auto-scale based on CPU/memory
Use external databases: For better reliability and backups
Enable monitoring: Use Prometheus/Grafana for metrics
Regular backups: Automate database and volume backups
TLS everywhere: Use cert-manager for automatic certificates
Network policies: Restrict pod-to-pod communication
Secrets management: Use external secret managers (Vault, AWS Secrets Manager)
Multi-zone deployment: Spread pods across availability zones

Performance Tuning

cvat:
  backend:
    server:
      replicas: 5
      resources:
        requests:
          cpu: 2000m
          memory: 4Gi
        limits:
          cpu: 4000m
          memory: 8Gi
    worker:
      chunks:
        replicas: 10
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi

postgresql:
  primary:
    resources:
      requests:
        cpu: 2000m
        memory: 4Gi
    persistence:
      size: 100Gi
      storageClass: premium-ssd

redis:
  master:
    resources:
      requests:
        cpu: 1000m
        memory: 2Gi

Installation

Administration

Serverless Functions

Documentation Index

​Prerequisites

​Kubernetes Cluster

​Required Tools

​Storage Provider

​Ingress Controller (Optional)

​Installation

​1. Add Helm Repository

​2. Create Namespace

​3. Basic Installation

​4. Wait for Pods to Start

​5. Create Superuser

​6. Access CVAT

​Configuration

​Custom Values File

​Ingress Configuration

​Using Nginx Ingress

​Using Embedded Traefik

​External Database

​External Redis

​Scaling Workers

​High Availability

​Chart Structure

​Dependencies

​Templates

​Operations

​Upgrade CVAT

​Rollback

​Uninstall

​Backup and Restore

​View Logs

​Exec into Pods

​Monitoring

​Troubleshooting

​Pods Not Starting

​Database Connection Issues

​Storage Issues

​Worker Not Processing Jobs

​Ingress Not Working

​Advanced Configuration

​Custom Storage Classes

​Node Affinity and Tolerations

​Additional Environment Variables

​Custom Volumes

​Production Best Practices

​Performance Tuning

​Next Steps

Build docs developers (and LLMs) love

Prerequisites

Kubernetes Cluster

Required Tools

Storage Provider

Ingress Controller (Optional)

Installation

1. Add Helm Repository

2. Create Namespace

3. Basic Installation

4. Wait for Pods to Start

5. Create Superuser

6. Access CVAT

Configuration

Custom Values File

Ingress Configuration

Using Nginx Ingress

Using Embedded Traefik

External Database

External Redis

Scaling Workers

High Availability

Chart Structure

Dependencies

Templates

Operations

Upgrade CVAT

Rollback

Uninstall

Backup and Restore

View Logs

Exec into Pods

Monitoring

Troubleshooting

Pods Not Starting

Database Connection Issues

Storage Issues

Worker Not Processing Jobs

Ingress Not Working

Advanced Configuration

Custom Storage Classes

Node Affinity and Tolerations

Additional Environment Variables

Custom Volumes

Production Best Practices

Performance Tuning

Next Steps