NativeLink exposes comprehensive metrics about cache operations and remote execution through OpenTelemetry (OTEL). This guide covers setting up monitoring infrastructure to track system health, performance, and reliability.
Overview
NativeLink provides insights into:
Cache Performance : Hit rates, operation latencies, eviction rates
Execution Pipeline : Queue times, stage durations, success rates
System Health : Worker utilization, throughput, error rates
Quick Start with Docker Compose
Start the metrics stack
cd deployment-examples/metrics
docker-compose up -d
This starts:
OpenTelemetry Collector (ports 4317, 4318)
Prometheus (port 9091)
Grafana (port 3000)
AlertManager (port 9093)
Configure NativeLink
Set environment variables to send metrics to the collector: export OTEL_EXPORTER_OTLP_ENDPOINT = http :// localhost : 4317
export OTEL_EXPORTER_OTLP_PROTOCOL = grpc
export OTEL_SERVICE_NAME = nativelink
export OTEL_RESOURCE_ATTRIBUTES = "deployment.environment=dev,nativelink.instance_name=main"
Start NativeLink
nativelink /path/to/config.json
Access metrics dashboards
Kubernetes Deployment
Deploy OTEL Collector
kubectl apply -f deployment-examples/metrics/kubernetes/otel-collector.yaml
Deploy Prometheus
kubectl apply -f deployment-examples/metrics/kubernetes/prometheus.yaml
Add environment variables to your NativeLink deployment:
env :
- name : OTEL_EXPORTER_OTLP_ENDPOINT
value : "http://otel-collector:4317"
- name : OTEL_EXPORTER_OTLP_PROTOCOL
value : "grpc"
- name : OTEL_RESOURCE_ATTRIBUTES
value : "deployment.environment=prod,k8s.cluster.name=main"
Environment Variables
OTLP Exporter Configuration
# Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT = http://localhost:4317
# Protocol (grpc or http/protobuf)
OTEL_EXPORTER_OTLP_PROTOCOL = grpc
# Optional authentication headers
OTEL_EXPORTER_OTLP_HEADERS = "Authorization=Bearer token"
# Compression (none, gzip)
OTEL_EXPORTER_OTLP_COMPRESSION = gzip
Resource Attributes
# Service name (fixed)
OTEL_SERVICE_NAME = nativelink
# Custom attributes for filtering and grouping
OTEL_RESOURCE_ATTRIBUTES = "key1=value1,key2=value2"
Recommended resource attributes:
deployment.environment - Environment name (dev, staging, prod)
nativelink.instance_name - Instance identifier
k8s.cluster.name - Kubernetes cluster name
cloud.region - Cloud provider region
Metric Export Configuration
# Export interval in milliseconds (default: 60s)
OTEL_METRIC_EXPORT_INTERVAL = 60000
# Export timeout in milliseconds (default: 30s)
OTEL_METRIC_EXPORT_TIMEOUT = 30000
# Disable traces if only metrics needed
OTEL_TRACES_EXPORTER = none
# Disable logs if only metrics needed
OTEL_LOGS_EXPORTER = none
Metrics Server Options
Prometheus (Recommended)
Prometheus offers native OTLP support and excellent query capabilities.
Direct OTLP Ingestion
prometheus --web.enable-otlp-receiver \
--storage.tsdb.out-of-order-time-window=30m
Via Collector Scraping
Configure Prometheus to scrape the OTEL Collector:
scrape_configs :
- job_name : 'otel-collector'
static_configs :
- targets : [ 'otel-collector:9090' ]
Grafana Cloud
For managed metrics:
otel-collector-config.yaml
exporters :
otlphttp :
endpoint : https://otlp-gateway-prod-us-central-0.grafana.net/otlp
headers :
Authorization : "Bearer ${GRAFANA_CLOUD_TOKEN}"
ClickHouse
For high-volume metrics storage:
otel-collector-config.yaml
exporters :
clickhouse :
endpoint : tcp://clickhouse:9000
database : metrics
ttl_days : 30
metrics_table : otel_metrics
Grafana Dashboards
Import Pre-built Dashboard
NativeLink includes a comprehensive Grafana dashboard:
# Import via API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @deployment-examples/metrics/grafana/dashboards/nativelink-overview.json
Or import via the Grafana UI at http://localhost:3000
Key Dashboard Panels
The included dashboard provides:
Execution Pipeline
Queue depth over time
Stage duration percentiles
Success/failure rates
Active actions by stage
Cache Performance
Hit/miss rates by cache type
Operation latency distributions
Eviction rates
Cache size utilization
Worker Metrics
Worker utilization heatmap
Actions per worker
Worker timeout events
Error Tracking
Error rates by type
Failed execution breakdown
Retry counts
Example Queries
Cache Hit Rate
sum(rate(nativelink_cache_operations_total{cache_operation_result="hit"}[5m])) by (cache_type) /
sum(rate(nativelink_cache_operations_total{cache_operation_name="read"}[5m])) by (cache_type)
Execution Success Rate
sum(rate(nativelink_execution_completed_count_total{execution_result="success"}[5m])) /
sum(rate(nativelink_execution_completed_count_total[5m]))
Queue Depth by Priority
sum(nativelink_execution_active_count{execution_stage="queued"}) by (execution_priority)
P95 Cache Operation Latency
histogram_quantile(0.95,
sum(rate(nativelink_cache_operation_duration_bucket[5m])) by (le, cache_type)
)
Worker Utilization
count(nativelink_execution_active_count{execution_stage="executing"} > 0) /
count(count by (execution_worker_id) (nativelink_execution_active_count))
Joining with Resource Attributes
Use target_info to join metrics with resource attributes:
rate(nativelink_execution_completed_count_total[5m])
* on (job, instance) group_left (k8s_cluster_name, deployment_environment)
target_info
Alerting
See the Troubleshooting page for alert rule examples.
Health Check Endpoint
NativeLink exposes a health check endpoint that aggregates component health status.
Endpoint Details
Path : Configured via health_status_config.path in your server configuration
Method : GET
Response Format : JSON
Response Codes :
200 OK: All components healthy
503 Service Unavailable: One or more components failed or timed out
Configuration
{
"servers" : [
{
"listener" : {
"http" : {
"socket_address" : "0.0.0.0:50051"
}
},
"services" : {
"health_status" : {
"path" : "/status" ,
"timeout_seconds" : 5
}
}
}
]
}
The default timeout is 5 seconds if not specified. Components that don’t respond within this timeout are marked as timed out.
Example Response
[
{
"namespace" : "/stores/CAS_MAIN_STORE" ,
"status" : {
"Ok" : {
"struct_name" : "nativelink_store::filesystem_store::FilesystemStore" ,
"message" : "Healthy"
}
}
},
{
"namespace" : "/schedulers/main" ,
"status" : {
"Ok" : {
"struct_name" : "nativelink_scheduler::simple_scheduler::SimpleScheduler" ,
"message" : "3 workers connected"
}
}
}
]
Health Status Types
Component is functioning normally. {
"Ok" : {
"struct_name" : "component_type" ,
"message" : "descriptive message"
}
}
Component is starting up and not yet ready. {
"Initializing" : {
"struct_name" : "component_type" ,
"message" : "connecting to backend"
}
}
Component has non-fatal issues but is operational. {
"Warning" : {
"struct_name" : "component_type" ,
"message" : "high latency detected"
}
}
Component is not functioning. Results in 503 response. {
"Failed" : {
"struct_name" : "component_type" ,
"message" : "connection refused"
}
}
Component didn’t respond within the configured timeout. Results in 503 response. {
"Timeout" : {
"struct_name" : "component_type"
}
}
Using Health Checks
Kubernetes Liveness Probe
livenessProbe :
httpGet :
path : /status
port : 50051
initialDelaySeconds : 30
periodSeconds : 10
timeoutSeconds : 5
failureThreshold : 3
Load Balancer Health Check
curl http://nativelink:50051/status
The health check endpoint returns 503 if ANY component reports Failed or Timeout status. Ensure your health check thresholds account for temporary issues during rolling updates.
Next Steps
Metrics Reference Complete list of available metrics
Troubleshooting Common issues and solutions
Performance Tuning Optimize your deployment