NativeLink provides Kubernetes manifests for scalable production deployments with support for custom workers, telemetry, and GitOps workflows.
Prerequisites
- Kubernetes 1.24+
- kubectl configured
- 8GB+ RAM per node
- StorageClass for persistent volumes
- (Optional) Kustomize for configuration management
Quick Start
Create namespace
kubectl create namespace nativelink
Create ConfigMap
Create a ConfigMap with your NativeLink configuration:kubectl create configmap nativelink-config \
--from-file=nativelink-config.json5 \
-n nativelink
Deploy NativeLink
kubectl apply -f kubernetes/nativelink/nativelink.yaml -n nativelink
Verify deployment
kubectl get pods -n nativelink
kubectl logs -f deployment/nativelink -n nativelink
Core Deployment
Deployment Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: nativelink
spec:
replicas: 1
selector:
matchLabels:
app: nativelink
template:
metadata:
labels:
app: nativelink
spec:
containers:
- name: nativelink
image: trace_machina/nativelink:latest
env:
- name: RUST_LOG
value: info
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector-collector.default.svc:4317
- name: OTEL_EXPORTER_OTLP_COMPRESSION
value: zstd
ports:
- containerPort: 9090 # Metrics
- containerPort: 50051 # gRPC CAS
- containerPort: 50052 # gRPC Scheduler
- containerPort: 50061 # Worker API
volumeMounts:
- name: nativelink-config
mountPath: /nativelink-config.json5
subPath: nativelink-config.json5
- name: tls-volume
mountPath: /root
readOnly: true
args: ["/nativelink-config.json5"]
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumes:
- name: nativelink-config
configMap:
name: nativelink-config
- name: tls-volume
secret:
secretName: tls-secret
---
apiVersion: v1
kind: Service
metadata:
name: nativelink
spec:
selector:
app: nativelink
ports:
- name: metrics
protocol: TCP
port: 9090
targetPort: 9090
- name: grpc
protocol: TCP
port: 50051
targetPort: 50051
- name: grpcs
protocol: TCP
port: 50052
targetPort: 50052
- name: worker-api
protocol: TCP
port: 50061
targetPort: 50061
type: LoadBalancer
Worker Deployment
Deploy dedicated workers that connect to the scheduler:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nativelink-worker
spec:
replicas: 3
selector:
matchLabels:
app: nativelink-worker
template:
metadata:
labels:
app: nativelink-worker
spec:
containers:
- name: worker
image: trace_machina/nativelink:latest
env:
- name: RUST_LOG
value: info
- name: SCHEDULER_ENDPOINT
value: nativelink.default.svc.cluster.local
volumeMounts:
- name: worker-config
mountPath: /worker.json5
subPath: worker.json5
- name: cas-storage
mountPath: /data/cas
- name: work-dir
mountPath: /tmp/work
args: ["/worker.json5"]
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
volumes:
- name: worker-config
configMap:
name: worker-config
- name: cas-storage
persistentVolumeClaim:
claimName: cas-storage-pvc
- name: work-dir
emptyDir: {}
Persistent Storage
Use shared storage (NFS, S3, GCS) for multi-worker setups to ensure all workers can access the same CAS data.
PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cas-storage-pvc
spec:
accessModes:
- ReadWriteMany # Required for multi-worker
resources:
requests:
storage: 100Gi
storageClassName: nfs-client # Use your storage class
S3 Backend
For cloud deployments, use S3-compatible storage:
stores: [
{
name: "CAS_S3_STORE",
experimental_cloud_object_store: {
provider: "aws",
region: "us-east-1",
bucket: "nativelink-cas",
key_prefix: "cas/",
retry: {
max_retries: 6,
delay: 0.3,
jitter: 0.5,
},
},
},
]
Kustomize Setup
Use Kustomize for environment-specific configurations:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: nativelink
resources:
- nativelink.yaml
- worker.yaml
- pvc.yaml
configMapGenerator:
- name: nativelink-config
files:
- configs/nativelink-config.json5
- name: worker-config
files:
- configs/worker.json5
images:
- name: nativelink
newName: trace_machina/nativelink
newTag: v0.5.0 # Pin to specific version
Deploy with:
Autoscaling
Horizontal Pod Autoscaler
Scale workers based on CPU/memory:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nativelink-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nativelink-worker
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
KEDA for Job-Based Scaling
Scale based on queue length using KEDA:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: nativelink-worker-scaler
spec:
scaleTargetRef:
name: nativelink-worker
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: nativelink_queue_length
query: sum(nativelink_scheduler_queue_length)
threshold: '10'
Health Checks
Add liveness and readiness probes:
livenessProbe:
httpGet:
path: /status
port: 50061
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /status
port: 50061
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Monitoring
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nativelink
spec:
selector:
matchLabels:
app: nativelink
endpoints:
- port: metrics
interval: 30s
path: /metrics
OpenTelemetry Collector
Deploy OpenTelemetry Collector for traces:
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
See Production Deployment for complete monitoring setup.
Ingress
Expose NativeLink via Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: nativelink
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
ingressClassName: nginx
rules:
- host: nativelink.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: nativelink
port:
number: 50051
tls:
- hosts:
- nativelink.example.com
secretName: nativelink-tls
Troubleshooting
Pod Not Starting
# Check pod status
kubectl describe pod <pod-name> -n nativelink
# View logs
kubectl logs <pod-name> -n nativelink --previous
# Check events
kubectl get events -n nativelink --sort-by='.lastTimestamp'
Storage Issues
# Check PVC status
kubectl get pvc -n nativelink
# Describe PVC
kubectl describe pvc cas-storage-pvc -n nativelink
# Check storage class
kubectl get storageclass
Worker Connection Issues
# Test network connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- \
wget -O- http://nativelink.default.svc.cluster.local:50061/status
# Check DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- \
nslookup nativelink.default.svc.cluster.local