Skip to main content
This guide covers production deployment best practices, monitoring, security, and operational considerations for running NativeLink at scale.

Architecture Overview

A production NativeLink deployment typically consists of:
  • CAS Servers: Content Addressable Storage (1+ replicas)
  • Scheduler: Job scheduling and distribution (1+ replicas)
  • Workers: Build execution nodes (auto-scaled)
  • Storage Backend: S3, GCS, or distributed filesystem
  • Monitoring: Prometheus, Grafana, OpenTelemetry
  • Load Balancer: gRPC-capable load balancer
┌─────────────┐
│   Clients   │
└──────┬──────┘

┌──────▼──────────┐
│ Load Balancer   │
└──────┬──────────┘

   ┌───┴────┬────────────┬───────────┐
   │        │            │           │
┌──▼───┐ ┌─▼──┐  ┌─────▼─────┐  ┌──▼────────┐
│ CAS  │ │ AC │  │ Scheduler │  │ Telemetry │
└──┬───┘ └─┬──┘  └─────┬─────┘  └───────────┘
   │       │           │
   └───────┴─────┬─────┘

         ┌───────┴────────┐
         │                │
    ┌────▼────┐      ┌────▼────┐
    │ Worker  │ ...  │ Worker  │
    └─────────┘      └─────────┘
         │                │
    ┌────▼────────────────▼────┐
    │   Shared CAS Storage     │
    │   (S3/GCS/NFS)          │
    └─────────────────────────┘

Storage Strategy

Cloud Object Storage

For production, use cloud object storage (S3, GCS, Azure Blob) as the primary backend:
{
  stores: [
    {
      name: "CAS_PRODUCTION",
      verify: {
        verify_size: true,
        backend: {
          dedup: {
            index_store: {
              fast_slow: {
                fast: {
                  memory: { eviction_policy: { max_bytes: 2000000000 } }, // 2GB
                },
                slow: {
                  experimental_cloud_object_store: {
                    provider: "aws",
                    region: "us-east-1",
                    bucket: "nativelink-cas-index",
                    key_prefix: "prod/index/",
                    retry: {
                      max_retries: 6,
                      delay: 0.3,
                      jitter: 0.5,
                    },
                  },
                },
              },
            },
            content_store: {
              compression: {
                compression_algorithm: { lz4: {} },
                backend: {
                  fast_slow: {
                    fast: {
                      memory: { eviction_policy: { max_bytes: 5000000000 } }, // 5GB
                    },
                    slow: {
                      experimental_cloud_object_store: {
                        provider: "aws",
                        region: "us-east-1",
                        bucket: "nativelink-cas-content",
                        key_prefix: "prod/content/",
                        retry: {
                          max_retries: 6,
                          delay: 0.3,
                          jitter: 0.5,
                        },
                      },
                    },
                  },
                },
              },
            },
          },
        },
      },
    },
  ],
}

Storage Best Practices

1

Use tiered storage

Combine fast (Redis/memory) and slow (S3/GCS) tiers for optimal performance:
  • Memory tier: 2-5GB for hot data
  • Redis tier: 50-100GB for frequently accessed objects
  • Cloud storage: Unlimited long-term storage
2

Enable compression

Use LZ4 compression to reduce storage costs and network bandwidth:
compression: {
  compression_algorithm: { lz4: {} },
  backend: { /* ... */ },
}
3

Configure lifecycle policies

Set up S3/GCS lifecycle policies to archive or delete old objects:
  • Transition to cheaper storage after 90 days
  • Delete objects older than 1 year
4

Use separate buckets

Separate CAS index, CAS content, and AC into different buckets for better organization and access control.
Always enable verify_size: true in production to detect corrupted data.

High Availability

Scheduler Redundancy

Run multiple scheduler replicas with a load balancer:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink-scheduler
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nativelink-scheduler
              topologyKey: kubernetes.io/hostname
      containers:
        - name: scheduler
          image: trace_machina/nativelink:v0.5.0
          resources:
            requests:
              memory: "4Gi"
              cpu: "2000m"
            limits:
              memory: "8Gi"
              cpu: "4000m"

CAS Server Redundancy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink-cas
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: cas
          resources:
            requests:
              memory: "8Gi"
              cpu: "4000m"
            limits:
              memory: "16Gi"
              cpu: "8000m"

Load Balancing

Use gRPC-aware load balancing:
apiVersion: v1
kind: Service
metadata:
  name: nativelink
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
spec:
  type: LoadBalancer
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
  ports:
    - name: grpc-cas
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: grpc-scheduler
      port: 50052
      targetPort: 50052
      protocol: TCP

Security

TLS/SSL Encryption

Always use TLS in production:
servers: [
  {
    listener: {
      http: {
        socket_address: "0.0.0.0:50051",
        tls: {
          cert_file: "/certs/server.crt",
          key_file: "/certs/server.key",
          ca_file: "/certs/ca.crt",
          verify_client: true,  // mTLS
        },
      },
    },
  },
]

Certificate Management

Use cert-manager for automated certificate rotation:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: nativelink-tls
spec:
  secretName: nativelink-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - nativelink.example.com
    - cas.nativelink.example.com
    - scheduler.nativelink.example.com

Network Policies

Restrict network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nativelink-policy
spec:
  podSelector:
    matchLabels:
      app: nativelink
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: build-clients
      ports:
        - protocol: TCP
          port: 50051
        - protocol: TCP
          port: 50052
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: nativelink-worker
      ports:
        - protocol: TCP
          port: 50061

Access Control

Never expose the worker API (port 50061) publicly. It should only be accessible to workers within your network.

Monitoring Setup

Prometheus and Grafana

Deploy comprehensive monitoring:
version: '3.8'

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.98.0
    command: ["--config=/etc/otel-collector/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector/config.yaml:ro
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "9090:9090"   # Prometheus exporter

  prometheus:
    image: prom/prometheus:v3.0.0
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-otlp-receiver'
    volumes:
      - ./prometheus-config.yaml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "9091:9090"

  grafana:
    image: grafana/grafana:12.4.0
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3000:3000"

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager-config.yml:/etc/alertmanager/config.yml:ro
    ports:
      - "9093:9093"

volumes:
  prometheus_data:
  grafana_data:

Key Metrics

Monitor these critical metrics:
MetricDescriptionAlert Threshold
nativelink_scheduler_queue_lengthPending jobs> 100
nativelink_worker_active_countActive workers< 2
nativelink_cas_hit_rateCache hit ratio< 0.7
nativelink_request_duration_secondsRequest latencyp99 > 5s
nativelink_store_size_bytesStorage usage> 0.9 * max
nativelink_worker_execution_failuresFailed executions> 5 per minute

OpenTelemetry Configuration

Configure NativeLink to export telemetry:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_SERVICE_NAME=nativelink
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.version=v0.5.0"
export OTEL_EXPORTER_OTLP_COMPRESSION=zstd
export RUST_LOG=info

Alert Rules

prometheus-alerts.yml
groups:
  - name: nativelink
    interval: 30s
    rules:
      - alert: HighQueueLength
        expr: nativelink_scheduler_queue_length > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High job queue length"
          description: "Queue has {{ $value }} pending jobs"

      - alert: LowCacheHitRate
        expr: rate(nativelink_cas_hits[5m]) / rate(nativelink_cas_requests[5m]) < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

      - alert: WorkerDown
        expr: nativelink_worker_active_count < 2
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Insufficient workers"
          description: "Only {{ $value }} workers active"

      - alert: HighErrorRate
        expr: rate(nativelink_request_errors_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate"
          description: "Error rate is {{ $value | humanizePercentage }}"

Auto-Scaling

Horizontal Pod Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nativelink-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nativelink-worker
  minReplicas: 5
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: nativelink_queue_length
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60

KEDA Scaling

For more advanced scaling based on queue metrics:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nativelink-worker-scaler
spec:
  scaleTargetRef:
    name: nativelink-worker
  minReplicaCount: 5
  maxReplicaCount: 100
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: nativelink_queue_length
        query: |
          sum(nativelink_scheduler_queue_length)
        threshold: '20'

Backup and Disaster Recovery

Configuration Backup

# Backup Kubernetes configs
kubectl get configmap nativelink-config -o yaml > backup/config-$(date +%Y%m%d).yaml
kubectl get secret nativelink-tls -o yaml > backup/tls-$(date +%Y%m%d).yaml

# Version control
git add backup/
git commit -m "Backup $(date +%Y%m%d)"
git push

Data Recovery

With cloud object storage (S3/GCS), data is automatically replicated. Enable versioning:
# AWS S3
aws s3api put-bucket-versioning \
  --bucket nativelink-cas-content \
  --versioning-configuration Status=Enabled

# GCS
gsutil versioning set on gs://nativelink-prod-cas

Incident Response Plan

1

Detect incident

Monitor alerts from Prometheus/Alertmanager
2

Assess impact

Check affected services and users
3

Mitigate

  • Scale up workers if queue is backed up
  • Restart affected pods
  • Failover to backup region if available
4

Recover

  • Restore from backups if necessary
  • Verify data integrity
  • Resume normal operations
5

Post-mortem

  • Document incident
  • Identify root cause
  • Implement preventive measures

Performance Tuning

Resource Allocation

CAS Server:
  • CPU: 4-8 cores
  • Memory: 8-16GB
  • Disk I/O: High-performance SSD or network-attached storage
Scheduler:
  • CPU: 2-4 cores
  • Memory: 4-8GB
Worker:
  • CPU: Based on build workload (4-32 cores)
  • Memory: 2GB per CPU core
  • Disk: 100GB+ for work directory

Connection Pooling

servers: [
  {
    listener: {
      http: {
        socket_address: "0.0.0.0:50051",
        advanced_http: {
          http2_keepalive_interval: 30,
          http2_keepalive_timeout: 10,
          max_concurrent_streams: 1000,
        },
      },
    },
  },
]

File Descriptor Limits

global: {
  max_open_files: 65536,
}
Set system limits:
# /etc/security/limits.conf
* soft nofile 65536
* hard nofile 65536

Cost Optimization

1

Use spot instances for workers

Workers are stateless and can tolerate interruptions
2

Configure aggressive cache eviction

Set appropriate max_bytes to avoid over-provisioning
3

Use S3 Intelligent-Tiering

Automatically move infrequently accessed data to cheaper tiers
4

Enable compression

Reduce storage costs by 60-80% with LZ4 compression
5

Implement lifecycle policies

Delete or archive old build artifacts after 90 days
For Docker-based monitoring setup, see the complete configuration in deployment-examples/metrics/docker-compose.yaml.

Build docs developers (and LLMs) love