Metrics Not Appearing
Verify OTEL Configuration
Check that NativeLink is configured with OpenTelemetry environment variables:OTEL_EXPORTER_OTLP_ENDPOINTOTEL_EXPORTER_OTLP_PROTOCOLOTEL_SERVICE_NAME
If variables are missing
If variables are missing
Set the required environment variables before starting NativeLink:Then restart NativeLink.
Check Collector Health
Verify the OTEL Collector is receiving metrics:If collector is not receiving metrics
If collector is not receiving metrics
- Check NativeLink logs for connection errors
- Verify network connectivity to collector endpoint
- Check firewall rules allow traffic on port 4317 (gRPC) or 4318 (HTTP)
- Ensure collector is running:
docker ps | grep otel-collector
Inspect Collector Logs
- Connection errors from receivers
- Export errors to Prometheus
- Resource exhaustion warnings
Cache Metrics Missing
If you seenativelink_execution_* metrics but no nativelink_cache_* metrics:
Workaround
Use execution cache hit metrics instead:High Memory Usage
Adjust Collector Batch Size
Reduce memory usage by decreasing batch size:otel-collector-config.yaml
Increase Memory Limits
Increase memory limiter threshold:otel-collector-config.yaml
Reduce Metric Cardinality
Drop high-cardinality labels:otel-collector-config.yaml
See the Metrics Reference for guidance on identifying high-cardinality labels.
Out-of-Order Samples
If Prometheus logs show “out of order sample” errors:Enable Out-of-Order Ingestion
prometheus-config.yaml
Why this happens
Why this happens
Out-of-order samples occur when:
- Multiple NativeLink instances export metrics with slightly different timestamps
- Network delays cause metrics to arrive out of sequence
- Clock skew between instances
Worker Connection Issues
Workers Not Connecting to Scheduler
Error: “No workers available” in scheduler logsNetwork connectivity
Network connectivity
Verify worker can reach scheduler:Check:
- DNS resolution of scheduler hostname
- Network policies allow traffic
- Firewall rules permit connection
Configuration mismatch
Configuration mismatch
Ensure worker and scheduler have matching:
- Instance names
- Platform properties
- Authentication settings
Worker timeout
Worker timeout
Workers are removed after
worker_timeout_s (default 5s) without keepalive.Increase timeout in scheduler config:Worker Disconnecting Frequently
Error: Workers show as connected then quickly disconnect Check for:-
Resource exhaustion on worker nodes
-
Worker OOM kills
-
Network instability
- Check for packet loss
- Verify network quality between worker and scheduler
Execution Failures
Actions Timing Out
Error:execution_result="timeout"
Check action timeout configuration
Check action timeout configuration
Increase timeouts in scheduler config:
client_action_timeout_s: Max time without client updatesmax_action_executing_timeout_s: Max execution time on worker
Worker stuck on actions
Worker stuck on actions
If workers are alive but actions time out:
- Check worker resource utilization
- Review action logs for hangs
- Enable
max_action_executing_timeout_sto re-queue stuck actions:
High Retry Rate
Symptom:nativelink_execution_retry_count_total increasing rapidly
Check for failing workers
Check for failing workers
Query metrics for worker-specific failures:If failures are concentrated on specific workers:
- Drain the problematic worker
- Check worker logs for errors
- Verify worker has required dependencies
Reduce max retries
Reduce max retries
Prevent infinite retries by lowering threshold:Default is 3. Actions exceeding this limit will return the last error to the client.
Cache Issues
Low Cache Hit Rate
Symptom: Cache hit rate below expected thresholdCheck cache size limits
Check cache size limits
Verify cache isn’t evicting entries prematurely:If consistently at 100%, increase cache size:
Review eviction rates
Review eviction rates
High eviction rates indicate undersized cache:Solutions:
- Increase cache size
- Use tiered storage (FastSlow store)
- Implement size partitioning for large objects
Verify cache key consistency
Verify cache key consistency
Inconsistent action keys reduce hit rate:
- Check for non-deterministic build inputs
- Verify platform properties match across builds
- Review action digest computation
Cache Operation Errors
Error:cache_operation_result="error"
Filesystem store errors
Filesystem store errors
Common filesystem issues:Disk full:Permission denied:I/O errors:
S3/Cloud store errors
S3/Cloud store errors
Authentication failures:
- Verify AWS credentials are valid
- Check IAM permissions for bucket access
- Ensure credentials haven’t expired
Redis store errors
Redis store errors
Connection refused:Out of memory:Increase Redis max memory or enable eviction:
Performance Issues
High Queue Depth
Symptom:nativelink:queue_depth consistently high
Scale workers
Scale workers
Add more worker capacity:Monitor worker utilization:
Optimize worker allocation
Optimize worker allocation
Change allocation strategy in scheduler:
least_recently_used: Distribute evenly (default)most_recently_used: Maximize cache locality
Slow Cache Operations
Symptom: High P95 cache latencyUse tiered storage
Use tiered storage
Implement FastSlow store for hot/cold data:
Enable compression
Enable compression
Reduce I/O for network-backed stores:Note: Adds CPU overhead but reduces network transfer.
Alert Rules
Add these alert rules to catch issues early:High Error Rate
prometheus-alerts.yml
Cache Miss Rate High
Queue Backlog
Worker Utilization Low
Component Health Failed
Getting Help
If you’re still experiencing issues:GitHub Issues
Report bugs or request features
Discord Community
Get help from the community
Documentation
Browse full documentation
Performance Tuning
Optimize your deployment