Skip to main content
NativeLink provides Kubernetes manifests for scalable production deployments with support for custom workers, telemetry, and GitOps workflows.

Prerequisites

  • Kubernetes 1.24+
  • kubectl configured
  • 8GB+ RAM per node
  • StorageClass for persistent volumes
  • (Optional) Kustomize for configuration management

Quick Start

1

Create namespace

kubectl create namespace nativelink
2

Create ConfigMap

Create a ConfigMap with your NativeLink configuration:
kubectl create configmap nativelink-config \
  --from-file=nativelink-config.json5 \
  -n nativelink
3

Deploy NativeLink

kubectl apply -f kubernetes/nativelink/nativelink.yaml -n nativelink
4

Verify deployment

kubectl get pods -n nativelink
kubectl logs -f deployment/nativelink -n nativelink

Core Deployment

Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nativelink
  template:
    metadata:
      labels:
        app: nativelink
    spec:
      containers:
        - name: nativelink
          image: trace_machina/nativelink:latest
          env:
            - name: RUST_LOG
              value: info
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: http://otel-collector-collector.default.svc:4317
            - name: OTEL_EXPORTER_OTLP_COMPRESSION
              value: zstd
          ports:
            - containerPort: 9090  # Metrics
            - containerPort: 50051 # gRPC CAS
            - containerPort: 50052 # gRPC Scheduler
            - containerPort: 50061 # Worker API
          volumeMounts:
            - name: nativelink-config
              mountPath: /nativelink-config.json5
              subPath: nativelink-config.json5
            - name: tls-volume
              mountPath: /root
              readOnly: true
          args: ["/nativelink-config.json5"]
          resources:
            requests:
              memory: "2Gi"
              cpu: "1000m"
            limits:
              memory: "4Gi"
              cpu: "2000m"
      volumes:
        - name: nativelink-config
          configMap:
            name: nativelink-config
        - name: tls-volume
          secret:
            secretName: tls-secret
---
apiVersion: v1
kind: Service
metadata:
  name: nativelink
spec:
  selector:
    app: nativelink
  ports:
    - name: metrics
      protocol: TCP
      port: 9090
      targetPort: 9090
    - name: grpc
      protocol: TCP
      port: 50051
      targetPort: 50051
    - name: grpcs
      protocol: TCP
      port: 50052
      targetPort: 50052
    - name: worker-api
      protocol: TCP
      port: 50061
      targetPort: 50061
  type: LoadBalancer

Worker Deployment

Deploy dedicated workers that connect to the scheduler:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink-worker
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nativelink-worker
  template:
    metadata:
      labels:
        app: nativelink-worker
    spec:
      containers:
        - name: worker
          image: trace_machina/nativelink:latest
          env:
            - name: RUST_LOG
              value: info
            - name: SCHEDULER_ENDPOINT
              value: nativelink.default.svc.cluster.local
          volumeMounts:
            - name: worker-config
              mountPath: /worker.json5
              subPath: worker.json5
            - name: cas-storage
              mountPath: /data/cas
            - name: work-dir
              mountPath: /tmp/work
          args: ["/worker.json5"]
          resources:
            requests:
              memory: "4Gi"
              cpu: "2000m"
            limits:
              memory: "8Gi"
              cpu: "4000m"
      volumes:
        - name: worker-config
          configMap:
            name: worker-config
        - name: cas-storage
          persistentVolumeClaim:
            claimName: cas-storage-pvc
        - name: work-dir
          emptyDir: {}

Persistent Storage

Use shared storage (NFS, S3, GCS) for multi-worker setups to ensure all workers can access the same CAS data.

PersistentVolumeClaim

pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cas-storage-pvc
spec:
  accessModes:
    - ReadWriteMany  # Required for multi-worker
  resources:
    requests:
      storage: 100Gi
  storageClassName: nfs-client  # Use your storage class

S3 Backend

For cloud deployments, use S3-compatible storage:
stores: [
  {
    name: "CAS_S3_STORE",
    experimental_cloud_object_store: {
      provider: "aws",
      region: "us-east-1",
      bucket: "nativelink-cas",
      key_prefix: "cas/",
      retry: {
        max_retries: 6,
        delay: 0.3,
        jitter: 0.5,
      },
    },
  },
]

Kustomize Setup

Use Kustomize for environment-specific configurations:
kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: nativelink

resources:
  - nativelink.yaml
  - worker.yaml
  - pvc.yaml

configMapGenerator:
  - name: nativelink-config
    files:
      - configs/nativelink-config.json5
  - name: worker-config
    files:
      - configs/worker.json5

images:
  - name: nativelink
    newName: trace_machina/nativelink
    newTag: v0.5.0  # Pin to specific version
Deploy with:
kubectl apply -k .

Autoscaling

Horizontal Pod Autoscaler

Scale workers based on CPU/memory:
hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nativelink-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nativelink-worker
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

KEDA for Job-Based Scaling

Scale based on queue length using KEDA:
scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nativelink-worker-scaler
spec:
  scaleTargetRef:
    name: nativelink-worker
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: nativelink_queue_length
        query: sum(nativelink_scheduler_queue_length)
        threshold: '10'

Health Checks

Add liveness and readiness probes:
livenessProbe:
  httpGet:
    path: /status
    port: 50061
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /status
    port: 50061
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

Monitoring

Prometheus ServiceMonitor

servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nativelink
spec:
  selector:
    matchLabels:
      app: nativelink
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

OpenTelemetry Collector

Deploy OpenTelemetry Collector for traces:
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
See Production Deployment for complete monitoring setup.

Ingress

Expose NativeLink via Ingress:
ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nativelink
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
  ingressClassName: nginx
  rules:
    - host: nativelink.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: nativelink
                port:
                  number: 50051
  tls:
    - hosts:
        - nativelink.example.com
      secretName: nativelink-tls

Troubleshooting

Pod Not Starting

# Check pod status
kubectl describe pod <pod-name> -n nativelink

# View logs
kubectl logs <pod-name> -n nativelink --previous

# Check events
kubectl get events -n nativelink --sort-by='.lastTimestamp'

Storage Issues

# Check PVC status
kubectl get pvc -n nativelink

# Describe PVC
kubectl describe pvc cas-storage-pvc -n nativelink

# Check storage class
kubectl get storageclass

Worker Connection Issues

# Test network connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- \
  wget -O- http://nativelink.default.svc.cluster.local:50061/status

# Check DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- \
  nslookup nativelink.default.svc.cluster.local
For GitOps deployments with Flux, see the kubernetes/resources/flux directory in the NativeLink repository.

Build docs developers (and LLMs) love