Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TraceMachina/nativelink/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers production deployment best practices, monitoring, security, and operational considerations for running NativeLink at scale.

Architecture Overview

A production NativeLink deployment typically consists of:
  • CAS Servers: Content Addressable Storage (1+ replicas)
  • Scheduler: Job scheduling and distribution (1+ replicas)
  • Workers: Build execution nodes (auto-scaled)
  • Storage Backend: S3, GCS, or distributed filesystem
  • Monitoring: Prometheus, Grafana, OpenTelemetry
  • Load Balancer: gRPC-capable load balancer
┌─────────────┐
│   Clients   │
└──────┬──────┘

┌──────▼──────────┐
│ Load Balancer   │
└──────┬──────────┘

   ┌───┴────┬────────────┬───────────┐
   │        │            │           │
┌──▼───┐ ┌─▼──┐  ┌─────▼─────┐  ┌──▼────────┐
│ CAS  │ │ AC │  │ Scheduler │  │ Telemetry │
└──┬───┘ └─┬──┘  └─────┬─────┘  └───────────┘
   │       │           │
   └───────┴─────┬─────┘

         ┌───────┴────────┐
         │                │
    ┌────▼────┐      ┌────▼────┐
    │ Worker  │ ...  │ Worker  │
    └─────────┘      └─────────┘
         │                │
    ┌────▼────────────────▼────┐
    │   Shared CAS Storage     │
    │   (S3/GCS/NFS)          │
    └─────────────────────────┘

Storage Strategy

Cloud Object Storage

For production, use cloud object storage (S3, GCS, Azure Blob) as the primary backend:
{
  stores: [
    {
      name: "CAS_PRODUCTION",
      verify: {
        verify_size: true,
        backend: {
          dedup: {
            index_store: {
              fast_slow: {
                fast: {
                  memory: { eviction_policy: { max_bytes: 2000000000 } }, // 2GB
                },
                slow: {
                  experimental_cloud_object_store: {
                    provider: "aws",
                    region: "us-east-1",
                    bucket: "nativelink-cas-index",
                    key_prefix: "prod/index/",
                    retry: {
                      max_retries: 6,
                      delay: 0.3,
                      jitter: 0.5,
                    },
                  },
                },
              },
            },
            content_store: {
              compression: {
                compression_algorithm: { lz4: {} },
                backend: {
                  fast_slow: {
                    fast: {
                      memory: { eviction_policy: { max_bytes: 5000000000 } }, // 5GB
                    },
                    slow: {
                      experimental_cloud_object_store: {
                        provider: "aws",
                        region: "us-east-1",
                        bucket: "nativelink-cas-content",
                        key_prefix: "prod/content/",
                        retry: {
                          max_retries: 6,
                          delay: 0.3,
                          jitter: 0.5,
                        },
                      },
                    },
                  },
                },
              },
            },
          },
        },
      },
    },
  ],
}

Storage Best Practices

1

Use tiered storage

Combine fast (Redis/memory) and slow (S3/GCS) tiers for optimal performance:
  • Memory tier: 2-5GB for hot data
  • Redis tier: 50-100GB for frequently accessed objects
  • Cloud storage: Unlimited long-term storage
2

Enable compression

Use LZ4 compression to reduce storage costs and network bandwidth:
compression: {
  compression_algorithm: { lz4: {} },
  backend: { /* ... */ },
}
3

Configure lifecycle policies

Set up S3/GCS lifecycle policies to archive or delete old objects:
  • Transition to cheaper storage after 90 days
  • Delete objects older than 1 year
4

Use separate buckets

Separate CAS index, CAS content, and AC into different buckets for better organization and access control.
Always enable verify_size: true in production to detect corrupted data.

High Availability

Scheduler Redundancy

Run multiple scheduler replicas with a load balancer:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink-scheduler
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nativelink-scheduler
              topologyKey: kubernetes.io/hostname
      containers:
        - name: scheduler
          image: trace_machina/nativelink:v0.5.0
          resources:
            requests:
              memory: "4Gi"
              cpu: "2000m"
            limits:
              memory: "8Gi"
              cpu: "4000m"

CAS Server Redundancy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink-cas
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: cas
          resources:
            requests:
              memory: "8Gi"
              cpu: "4000m"
            limits:
              memory: "16Gi"
              cpu: "8000m"

Load Balancing

Use gRPC-aware load balancing:
apiVersion: v1
kind: Service
metadata:
  name: nativelink
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
spec:
  type: LoadBalancer
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
  ports:
    - name: grpc-cas
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: grpc-scheduler
      port: 50052
      targetPort: 50052
      protocol: TCP

Security

TLS/SSL Encryption

Always use TLS in production:
servers: [
  {
    listener: {
      http: {
        socket_address: "0.0.0.0:50051",
        tls: {
          cert_file: "/certs/server.crt",
          key_file: "/certs/server.key",
          ca_file: "/certs/ca.crt",
          verify_client: true,  // mTLS
        },
      },
    },
  },
]

Certificate Management

Use cert-manager for automated certificate rotation:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: nativelink-tls
spec:
  secretName: nativelink-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - nativelink.example.com
    - cas.nativelink.example.com
    - scheduler.nativelink.example.com

Network Policies

Restrict network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nativelink-policy
spec:
  podSelector:
    matchLabels:
      app: nativelink
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: build-clients
      ports:
        - protocol: TCP
          port: 50051
        - protocol: TCP
          port: 50052
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: nativelink-worker
      ports:
        - protocol: TCP
          port: 50061

Access Control

Never expose the worker API (port 50061) publicly. It should only be accessible to workers within your network.

Monitoring Setup

Prometheus and Grafana

Deploy comprehensive monitoring:
version: '3.8'

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.98.0
    command: ["--config=/etc/otel-collector/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector/config.yaml:ro
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "9090:9090"   # Prometheus exporter

  prometheus:
    image: prom/prometheus:v3.0.0
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-otlp-receiver'
    volumes:
      - ./prometheus-config.yaml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "9091:9090"

  grafana:
    image: grafana/grafana:12.4.0
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3000:3000"

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager-config.yml:/etc/alertmanager/config.yml:ro
    ports:
      - "9093:9093"

volumes:
  prometheus_data:
  grafana_data:

Key Metrics

Monitor these critical metrics:
MetricDescriptionAlert Threshold
nativelink_scheduler_queue_lengthPending jobs> 100
nativelink_worker_active_countActive workers< 2
nativelink_cas_hit_rateCache hit ratio< 0.7
nativelink_request_duration_secondsRequest latencyp99 > 5s
nativelink_store_size_bytesStorage usage> 0.9 * max
nativelink_worker_execution_failuresFailed executions> 5 per minute

OpenTelemetry Configuration

Configure NativeLink to export telemetry:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_SERVICE_NAME=nativelink
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.version=v0.5.0"
export OTEL_EXPORTER_OTLP_COMPRESSION=zstd
export RUST_LOG=info

Alert Rules

prometheus-alerts.yml
groups:
  - name: nativelink
    interval: 30s
    rules:
      - alert: HighQueueLength
        expr: nativelink_scheduler_queue_length > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High job queue length"
          description: "Queue has {{ $value }} pending jobs"

      - alert: LowCacheHitRate
        expr: rate(nativelink_cas_hits[5m]) / rate(nativelink_cas_requests[5m]) < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

      - alert: WorkerDown
        expr: nativelink_worker_active_count < 2
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Insufficient workers"
          description: "Only {{ $value }} workers active"

      - alert: HighErrorRate
        expr: rate(nativelink_request_errors_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate"
          description: "Error rate is {{ $value | humanizePercentage }}"

Auto-Scaling

Horizontal Pod Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nativelink-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nativelink-worker
  minReplicas: 5
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: nativelink_queue_length
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60

KEDA Scaling

For more advanced scaling based on queue metrics:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nativelink-worker-scaler
spec:
  scaleTargetRef:
    name: nativelink-worker
  minReplicaCount: 5
  maxReplicaCount: 100
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: nativelink_queue_length
        query: |
          sum(nativelink_scheduler_queue_length)
        threshold: '20'

Backup and Disaster Recovery

Configuration Backup

# Backup Kubernetes configs
kubectl get configmap nativelink-config -o yaml > backup/config-$(date +%Y%m%d).yaml
kubectl get secret nativelink-tls -o yaml > backup/tls-$(date +%Y%m%d).yaml

# Version control
git add backup/
git commit -m "Backup $(date +%Y%m%d)"
git push

Data Recovery

With cloud object storage (S3/GCS), data is automatically replicated. Enable versioning:
# AWS S3
aws s3api put-bucket-versioning \
  --bucket nativelink-cas-content \
  --versioning-configuration Status=Enabled

# GCS
gsutil versioning set on gs://nativelink-prod-cas

Incident Response Plan

1

Detect incident

Monitor alerts from Prometheus/Alertmanager
2

Assess impact

Check affected services and users
3

Mitigate

  • Scale up workers if queue is backed up
  • Restart affected pods
  • Failover to backup region if available
4

Recover

  • Restore from backups if necessary
  • Verify data integrity
  • Resume normal operations
5

Post-mortem

  • Document incident
  • Identify root cause
  • Implement preventive measures

Performance Tuning

Resource Allocation

CAS Server:
  • CPU: 4-8 cores
  • Memory: 8-16GB
  • Disk I/O: High-performance SSD or network-attached storage
Scheduler:
  • CPU: 2-4 cores
  • Memory: 4-8GB
Worker:
  • CPU: Based on build workload (4-32 cores)
  • Memory: 2GB per CPU core
  • Disk: 100GB+ for work directory

Connection Pooling

servers: [
  {
    listener: {
      http: {
        socket_address: "0.0.0.0:50051",
        advanced_http: {
          http2_keepalive_interval: 30,
          http2_keepalive_timeout: 10,
          max_concurrent_streams: 1000,
        },
      },
    },
  },
]

File Descriptor Limits

global: {
  max_open_files: 65536,
}
Set system limits:
# /etc/security/limits.conf
* soft nofile 65536
* hard nofile 65536

Cost Optimization

1

Use spot instances for workers

Workers are stateless and can tolerate interruptions
2

Configure aggressive cache eviction

Set appropriate max_bytes to avoid over-provisioning
3

Use S3 Intelligent-Tiering

Automatically move infrequently accessed data to cheaper tiers
4

Enable compression

Reduce storage costs by 60-80% with LZ4 compression
5

Implement lifecycle policies

Delete or archive old build artifacts after 90 days
For Docker-based monitoring setup, see the complete configuration in deployment-examples/metrics/docker-compose.yaml.

Build docs developers (and LLMs) love