Architecture Overview
A production NativeLink deployment typically consists of:- CAS Servers: Content Addressable Storage (1+ replicas)
- Scheduler: Job scheduling and distribution (1+ replicas)
- Workers: Build execution nodes (auto-scaled)
- Storage Backend: S3, GCS, or distributed filesystem
- Monitoring: Prometheus, Grafana, OpenTelemetry
- Load Balancer: gRPC-capable load balancer
Storage Strategy
Cloud Object Storage
For production, use cloud object storage (S3, GCS, Azure Blob) as the primary backend:Storage Best Practices
Use tiered storage
Combine fast (Redis/memory) and slow (S3/GCS) tiers for optimal performance:
- Memory tier: 2-5GB for hot data
- Redis tier: 50-100GB for frequently accessed objects
- Cloud storage: Unlimited long-term storage
Configure lifecycle policies
Set up S3/GCS lifecycle policies to archive or delete old objects:
- Transition to cheaper storage after 90 days
- Delete objects older than 1 year
High Availability
Scheduler Redundancy
Run multiple scheduler replicas with a load balancer:CAS Server Redundancy
Load Balancing
Use gRPC-aware load balancing:Security
TLS/SSL Encryption
Always use TLS in production:Certificate Management
Use cert-manager for automated certificate rotation:Network Policies
Restrict network access:Access Control
Monitoring Setup
Prometheus and Grafana
Deploy comprehensive monitoring:Key Metrics
Monitor these critical metrics:| Metric | Description | Alert Threshold |
|---|---|---|
nativelink_scheduler_queue_length | Pending jobs | > 100 |
nativelink_worker_active_count | Active workers | < 2 |
nativelink_cas_hit_rate | Cache hit ratio | < 0.7 |
nativelink_request_duration_seconds | Request latency | p99 > 5s |
nativelink_store_size_bytes | Storage usage | > 0.9 * max |
nativelink_worker_execution_failures | Failed executions | > 5 per minute |
OpenTelemetry Configuration
Configure NativeLink to export telemetry:Alert Rules
prometheus-alerts.yml
Auto-Scaling
Horizontal Pod Autoscaling
KEDA Scaling
For more advanced scaling based on queue metrics:Backup and Disaster Recovery
Configuration Backup
Data Recovery
With cloud object storage (S3/GCS), data is automatically replicated. Enable versioning:Incident Response Plan
Mitigate
- Scale up workers if queue is backed up
- Restart affected pods
- Failover to backup region if available
Performance Tuning
Resource Allocation
CAS Server:- CPU: 4-8 cores
- Memory: 8-16GB
- Disk I/O: High-performance SSD or network-attached storage
- CPU: 2-4 cores
- Memory: 4-8GB
- CPU: Based on build workload (4-32 cores)
- Memory: 2GB per CPU core
- Disk: 100GB+ for work directory
Connection Pooling
File Descriptor Limits
Cost Optimization
For Docker-based monitoring setup, see the complete configuration in deployment-examples/metrics/docker-compose.yaml.