NativeLink exposes comprehensive metrics about cache operations and remote execution through OpenTelemetry (OTEL). This guide covers setting up monitoring infrastructure to track system health, performance, and reliability.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/TraceMachina/nativelink/llms.txt
Use this file to discover all available pages before exploring further.
Overview
NativeLink provides insights into:- Cache Performance: Hit rates, operation latencies, eviction rates
- Execution Pipeline: Queue times, stage durations, success rates
- System Health: Worker utilization, throughput, error rates
Quick Start with Docker Compose
Start the metrics stack
- OpenTelemetry Collector (ports 4317, 4318)
- Prometheus (port 9091)
- Grafana (port 3000)
- AlertManager (port 9093)
Access metrics dashboards
- Prometheus UI: http://localhost:9091
- Grafana: http://localhost:3000 (admin/admin)
- OTEL Collector health: http://localhost:13133/health
- Collector metrics: http://localhost:8888/metrics
Kubernetes Deployment
Deploy OTEL Collector
Deploy Prometheus
Configure NativeLink Deployment
Add environment variables to your NativeLink deployment:Environment Variables
OTLP Exporter Configuration
Resource Attributes
Recommended resource attributes:
deployment.environment- Environment name (dev, staging, prod)nativelink.instance_name- Instance identifierk8s.cluster.name- Kubernetes cluster namecloud.region- Cloud provider region
Metric Export Configuration
Metrics Server Options
Prometheus (Recommended)
Prometheus offers native OTLP support and excellent query capabilities.Direct OTLP Ingestion
Via Collector Scraping
Configure Prometheus to scrape the OTEL Collector:prometheus-config.yaml
Grafana Cloud
For managed metrics:otel-collector-config.yaml
ClickHouse
For high-volume metrics storage:otel-collector-config.yaml
Grafana Dashboards
Import Pre-built Dashboard
NativeLink includes a comprehensive Grafana dashboard:Key Dashboard Panels
The included dashboard provides:Execution Pipeline
- Queue depth over time
- Stage duration percentiles
- Success/failure rates
- Active actions by stage
Cache Performance
- Hit/miss rates by cache type
- Operation latency distributions
- Eviction rates
- Cache size utilization
Worker Metrics
- Worker utilization heatmap
- Actions per worker
- Worker timeout events
Error Tracking
- Error rates by type
- Failed execution breakdown
- Retry counts
Example Queries
Cache Hit Rate
Execution Success Rate
Queue Depth by Priority
P95 Cache Operation Latency
Worker Utilization
Joining with Resource Attributes
Usetarget_info to join metrics with resource attributes:
Alerting
See the Troubleshooting page for alert rule examples.Health Check Endpoint
NativeLink exposes a health check endpoint that aggregates component health status.Endpoint Details
- Path: Configured via
health_status_config.pathin your server configuration - Method: GET
- Response Format: JSON
- Response Codes:
200 OK: All components healthy503 Service Unavailable: One or more components failed or timed out
Configuration
nativelink-config.json
The default timeout is 5 seconds if not specified. Components that don’t respond within this timeout are marked as timed out.
Example Response
Health Status Types
Ok
Ok
Component is functioning normally.
Initializing
Initializing
Component is starting up and not yet ready.
Warning
Warning
Component has non-fatal issues but is operational.
Failed
Failed
Component is not functioning. Results in 503 response.
Timeout
Timeout
Component didn’t respond within the configured timeout. Results in 503 response.
Using Health Checks
Kubernetes Liveness Probe
Load Balancer Health Check
Next Steps
Metrics Reference
Complete list of available metrics
Troubleshooting
Common issues and solutions
Performance Tuning
Optimize your deployment