Cache Optimization
Memory Cache Configuration
Memory caches provide the fastest access but are limited by available RAM.nativelink-config.json
Sizing recommendations
Sizing recommendations
Small deployments (< 10 workers):
- CAS: 5-10 GB
- AC: 1-2 GB
- CAS: 20-50 GB
- AC: 5-10 GB
- CAS: 100+ GB
- AC: 20+ GB
nativelink_cache_size and eviction rates to right-size your cache.Tiered Storage (FastSlow)
Combine fast memory cache with slower persistent storage:The FastSlow store:
- Checks fast tier first on reads
- Promotes slow tier hits to fast tier
- Writes to both tiers simultaneously
- Assumes fast tier presence implies slow tier presence
Deduplication Store
For workloads with similar files (e.g., incremental builds):When to use deduplication
When to use deduplication
Good for:
- Incremental builds with mostly unchanged files
- Large binary artifacts with common sections
- Uncompressed content
- Compressed or encrypted content
- Highly diverse files
- When upload/download isn’t the bottleneck
- CPU overhead for rolling hash computation
- Storage reduction: 30-70% for typical builds
- Network reduction: Similar to storage reduction
Size Partitioning
Route small and large objects to different stores:Compression
Reduce network transfer and storage at the cost of CPU:Compression algorithm comparison
Compression algorithm comparison
LZ4:
- Compression ratio: 2-3x
- Speed: Very fast (500+ MB/s)
- CPU usage: Low
- Best for: Most use cases, hot path caches
- Compression ratio: 3-5x
- Speed: Fast (200-400 MB/s)
- CPU usage: Medium
- Best for: Cold storage, WAN transfers
- Network bandwidth is limited
- Storage is expensive
- CPU capacity is available
- Content is already compressed (images, videos)
- CPU is constrained
- Local/datacenter networking with high bandwidth
Scheduler Optimization
Worker Allocation Strategy
nativelink-config.json
least_recently_used (default)
least_recently_used (default)
Distributes load evenly across all workers.Pros:
- Balanced resource utilization
- Prevents worker overload
- Better for heterogeneous workloads
- Lower cache locality
- More cache misses on workers
most_recently_used
most_recently_used
Prefers recently-used workers to maximize cache hits.Pros:
- Higher cache hit rate on workers
- Better for repeated builds
- Fewer cold starts
- Can create hot spots
- Some workers may be underutilized
Timeout Configuration
worker_timeout_s (default: 5)
worker_timeout_s (default: 5)
Time before removing unresponsive workers.Lower values (5-10s):
- Faster failure detection
- Quicker reallocation of stuck actions
- Risk: Network hiccups remove healthy workers
- Tolerates transient network issues
- Reduces worker churn
- Risk: Slow to detect truly dead workers
client_action_timeout_s (default: 60)
client_action_timeout_s (default: 60)
Time before marking actions as failed if client stops updating.Recommendation:
- 300s (5 min) for interactive builds
- 600s (10 min) for CI/CD
- Match your client’s expected update interval
max_action_executing_timeout_s (default: 0/disabled)
max_action_executing_timeout_s (default: 0/disabled)
Maximum execution time regardless of worker keepalives.When to enable:
- Workers occasionally hang on specific actions
- Need hard limit on execution time
- Want to enforce build time SLOs
- 1800s (30 min) for typical builds
- 3600s (1 hour) for long-running tests
- 0 (disabled) if relying only on
worker_timeout_s
retain_completed_for_s (default: 60)
retain_completed_for_s (default: 60)
How long to keep completed action results in memory.Lower values (30-60s):
- Less memory usage
- Risk:
WaitExecutioncalls may miss results
- Better for slow clients
- More memory usage
- Useful for debugging
Retry Configuration
Retries apply to internal errors and timeouts. If an action fails
max_job_retries times, the scheduler returns the last error to the client instead of retrying indefinitely.- 2-3 retries: Most deployments (default: 3)
- 0-1 retries: Flaky infrastructure, prefer failing fast
- 5+ retries: Very unreliable workers (investigate root cause instead)
Worker Configuration
Concurrent Actions
Control how many actions a worker executes simultaneously:worker-config.json
Sizing guidelines
Sizing guidelines
CPU-bound workloads (compilation):
- 1 action per CPU core
- Example: 8-core machine →
max_concurrent_actions: 8
- 2-4 actions per CPU core
- Example: 8-core machine →
max_concurrent_actions: 16-32
- Start with 1.5x CPU cores
- Monitor CPU and I/O wait
- Adjust based on utilization
- Calculate per-action memory:
total_memory / max_concurrent_actions - Ensure sufficient memory for largest expected action
Platform Properties
Optimize worker matching:scheduler-config.json
Property type strategies
Property type strategies
minimum:- Worker must have at least the requested value
- Used for: cpu_count, memory_gb, disk_gb
- Example: Action requests
cpu_count: 8, worker with 16 cores matches
exact:- Worker must exactly match requested value
- Used for: os, cpu_arch, gpu_type
- Example: Action requests
os: linux, only Linux workers match
priority:- Informational only, doesn’t restrict matching
- Passed to worker but not enforced
- Future: May influence worker preference
ignore:- Allows property in actions
- Doesn’t require workers to have it
- Used for optional capabilities
Network Optimization
gRPC Connection Pooling
connections_per_endpoint
connections_per_endpoint
Number of concurrent gRPC connections to each endpoint.Lower values (1-2):
- Less memory overhead
- Fewer file descriptors
- May bottleneck on high throughput
- Better throughput for concurrent requests
- More resource usage
- Diminishing returns beyond 10
rpc_timeout_s
rpc_timeout_s
Maximum time for RPC calls.Shorter timeouts (30s-2m):
- Fail fast on network issues
- Better for small objects
- May fail for large uploads/downloads
- Tolerates slow networks
- Required for large objects
- Slower to detect hung connections
- 5m for typical deployments
- 30m if transferring multi-GB objects
- Match to largest expected object transfer time
Retry Configuration
max_retries: Number of retry attempts (exponential backoff)delay: Initial delay in secondsjitter: Random factor (0.0-1.0) to prevent thundering herd
delay * (2 ^ attempt) * (1 + random(-jitter, jitter))Monitoring-Driven Optimization
Key Metrics to Track
Cache Hit Rate
Worker Utilization
Queue Depth
P95 Latency
Optimization Workflow
Identify bottleneck
Check key metrics:
- High queue depth → Need more workers
- Low cache hit rate → Increase cache size or review keys
- High P95 latency → Use tiered storage or compression
- Low worker utilization → Reduce worker count or improve allocation
Make targeted change
Apply one optimization at a time:
- Adjust configuration
- Monitor for 15-30 minutes
- Compare before/after metrics
Resource Limits
OpenTelemetry Collector
otel-collector-config.yaml
Tuning guidelines
Tuning guidelines
High throughput (many workers, high QPS):
limit_mib: 1024+send_batch_size: 2048timeout: 5s
limit_mib: 256send_batch_size: 512timeout: 30s
otelcol_processor_refused_metric_points should be 0Prometheus Storage
prometheus-config.yaml
Estimate Prometheus storage:
samples/sec * retention_seconds * 1-2 bytes/sampleFor 1000 series at 15s interval for 30 days: ~170 MBBest Practices Summary
Cache Configuration
Cache Configuration
- Use tiered storage (memory + disk) for best performance
- Size memory cache to 10-20% of working set
- Enable compression for remote stores
- Use deduplication for incremental builds
Scheduler Tuning
Scheduler Tuning
- Set
worker_timeout_sto 30s for production - Use
most_recently_usedallocation for CI/CD - Configure
max_action_executing_timeout_sto catch hung actions - Keep
max_job_retriesat 2-3
Worker Optimization
Worker Optimization
- Match
max_concurrent_actionsto workload type - Define precise platform properties
- Scale workers based on queue depth
- Monitor per-worker cache hit rates
Network Performance
Network Performance
- Use 5 connections per gRPC endpoint
- Set appropriate RPC timeouts for object sizes
- Configure retries with jitter
- Enable compression for WAN transfers
Next Steps
Metrics Reference
Track optimization impact with metrics
Troubleshooting
Debug performance issues
Monitoring Setup
Configure alerting for performance regressions