Monitoring and Observability

NativeLink exposes comprehensive metrics about cache operations and remote execution through OpenTelemetry (OTEL). This guide covers setting up monitoring infrastructure to track system health, performance, and reliability.

Overview

NativeLink provides insights into:

Cache Performance: Hit rates, operation latencies, eviction rates
Execution Pipeline: Queue times, stage durations, success rates
System Health: Worker utilization, throughput, error rates

Quick Start with Docker Compose

Start the metrics stack

cd deployment-examples/metrics
docker-compose up -d

This starts:

OpenTelemetry Collector (ports 4317, 4318)
Prometheus (port 9091)
Grafana (port 3000)
AlertManager (port 9093)

Configure NativeLink

Set environment variables to send metrics to the collector:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_SERVICE_NAME=nativelink
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=dev,nativelink.instance_name=main"

Start NativeLink

nativelink /path/to/config.json

Access metrics dashboards

Prometheus UI: http://localhost:9091
Grafana: http://localhost:3000 (admin/admin)
OTEL Collector health: http://localhost:13133/health
Collector metrics: http://localhost:8888/metrics

Kubernetes Deployment

Deploy OTEL Collector

kubectl apply -f deployment-examples/metrics/kubernetes/otel-collector.yaml

Deploy Prometheus

kubectl apply -f deployment-examples/metrics/kubernetes/prometheus.yaml

Configure NativeLink Deployment

Add environment variables to your NativeLink deployment:

env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector:4317"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment=prod,k8s.cluster.name=main"

Environment Variables

OTLP Exporter Configuration

# Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Protocol (grpc or http/protobuf)
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

# Optional authentication headers
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"

# Compression (none, gzip)
OTEL_EXPORTER_OTLP_COMPRESSION=gzip

Resource Attributes

# Service name (fixed)
OTEL_SERVICE_NAME=nativelink

# Custom attributes for filtering and grouping
OTEL_RESOURCE_ATTRIBUTES="key1=value1,key2=value2"

Recommended resource attributes:

deployment.environment - Environment name (dev, staging, prod)
nativelink.instance_name - Instance identifier
k8s.cluster.name - Kubernetes cluster name
cloud.region - Cloud provider region

Metric Export Configuration

# Export interval in milliseconds (default: 60s)
OTEL_METRIC_EXPORT_INTERVAL=60000

# Export timeout in milliseconds (default: 30s)
OTEL_METRIC_EXPORT_TIMEOUT=30000

# Disable traces if only metrics needed
OTEL_TRACES_EXPORTER=none

# Disable logs if only metrics needed
OTEL_LOGS_EXPORTER=none

Metrics Server Options

Prometheus (Recommended)

Prometheus offers native OTLP support and excellent query capabilities.

Direct OTLP Ingestion

prometheus --web.enable-otlp-receiver \
          --storage.tsdb.out-of-order-time-window=30m

Via Collector Scraping

Configure Prometheus to scrape the OTEL Collector:

prometheus-config.yaml

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:9090']

Grafana Cloud

For managed metrics:

otel-collector-config.yaml

exporters:
  otlphttp:
    endpoint: https://otlp-gateway-prod-us-central-0.grafana.net/otlp
    headers:
      Authorization: "Bearer ${GRAFANA_CLOUD_TOKEN}"

ClickHouse

For high-volume metrics storage:

otel-collector-config.yaml

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000
    database: metrics
    ttl_days: 30
    metrics_table: otel_metrics

Grafana Dashboards

Import Pre-built Dashboard

NativeLink includes a comprehensive Grafana dashboard:

# Import via API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @deployment-examples/metrics/grafana/dashboards/nativelink-overview.json

Or import via the Grafana UI at http://localhost:3000

Key Dashboard Panels

The included dashboard provides:

Execution Pipeline

Queue depth over time
Stage duration percentiles
Success/failure rates
Active actions by stage

Cache Performance

Hit/miss rates by cache type
Operation latency distributions
Eviction rates
Cache size utilization

Worker Metrics

Worker utilization heatmap
Actions per worker
Worker timeout events

Error Tracking

Error rates by type
Failed execution breakdown
Retry counts

Example Queries

Cache Hit Rate

sum(rate(nativelink_cache_operations_total{cache_operation_result="hit"}[5m])) by (cache_type) /
sum(rate(nativelink_cache_operations_total{cache_operation_name="read"}[5m])) by (cache_type)

Execution Success Rate

sum(rate(nativelink_execution_completed_count_total{execution_result="success"}[5m])) /
sum(rate(nativelink_execution_completed_count_total[5m]))

Queue Depth by Priority

sum(nativelink_execution_active_count{execution_stage="queued"}) by (execution_priority)

P95 Cache Operation Latency

histogram_quantile(0.95,
  sum(rate(nativelink_cache_operation_duration_bucket[5m])) by (le, cache_type)
)

Worker Utilization

count(nativelink_execution_active_count{execution_stage="executing"} > 0) /
count(count by (execution_worker_id) (nativelink_execution_active_count))

Joining with Resource Attributes

Use target_info to join metrics with resource attributes:

rate(nativelink_execution_completed_count_total[5m])
* on (job, instance) group_left (k8s_cluster_name, deployment_environment)
target_info

Alerting

See the Troubleshooting page for alert rule examples.

Health Check Endpoint

NativeLink exposes a health check endpoint that aggregates component health status.

Endpoint Details

Path: Configured via health_status_config.path in your server configuration
Method: GET
Response Format: JSON
Response Codes:
- 200 OK: All components healthy
- 503 Service Unavailable: One or more components failed or timed out

Configuration

nativelink-config.json

{
  "servers": [
    {
      "listener": {
        "http": {
          "socket_address": "0.0.0.0:50051"
        }
      },
      "services": {
        "health_status": {
          "path": "/status",
          "timeout_seconds": 5
        }
      }
    }
  ]
}

The default timeout is 5 seconds if not specified. Components that don’t respond within this timeout are marked as timed out.

Example Response

[
  {
    "namespace": "/stores/CAS_MAIN_STORE",
    "status": {
      "Ok": {
        "struct_name": "nativelink_store::filesystem_store::FilesystemStore",
        "message": "Healthy"
      }
    }
  },
  {
    "namespace": "/schedulers/main",
    "status": {
      "Ok": {
        "struct_name": "nativelink_scheduler::simple_scheduler::SimpleScheduler",
        "message": "3 workers connected"
      }
    }
  }
]

Health Status Types

Component is functioning normally.

{
  "Ok": {
    "struct_name": "component_type",
    "message": "descriptive message"
  }
}

Initializing

Component is starting up and not yet ready.

{
  "Initializing": {
    "struct_name": "component_type",
    "message": "connecting to backend"
  }
}

Warning

Component has non-fatal issues but is operational.

{
  "Warning": {
    "struct_name": "component_type",
    "message": "high latency detected"
  }
}

Failed

Component is not functioning. Results in 503 response.

{
  "Failed": {
    "struct_name": "component_type",
    "message": "connection refused"
  }
}

Timeout

Component didn’t respond within the configured timeout. Results in 503 response.

{
  "Timeout": {
    "struct_name": "component_type"
  }
}

Using Health Checks

Kubernetes Liveness Probe

livenessProbe:
  httpGet:
    path: /status
    port: 50051
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Load Balancer Health Check

curl http://nativelink:50051/status

The health check endpoint returns 503 if ANY component reports Failed or Timeout status. Ensure your health check thresholds account for temporary issues during rolling updates.

Next Steps

Metrics Reference

Complete list of available metrics

Troubleshooting

Common issues and solutions

Performance Tuning

Optimize your deployment

Getting Started

Core Concepts

Deployment

Integration

Operations

Security

Documentation Index

​Overview

​Quick Start with Docker Compose

​Kubernetes Deployment

​Deploy OTEL Collector

​Deploy Prometheus

​Configure NativeLink Deployment

​Environment Variables

​OTLP Exporter Configuration

​Resource Attributes

​Metric Export Configuration

​Metrics Server Options

​Prometheus (Recommended)

​Direct OTLP Ingestion

​Via Collector Scraping

​Grafana Cloud

​ClickHouse

​Grafana Dashboards

​Import Pre-built Dashboard

​Key Dashboard Panels

Execution Pipeline

Cache Performance

Worker Metrics

Error Tracking

​Example Queries

​Cache Hit Rate

​Execution Success Rate

​Queue Depth by Priority

​P95 Cache Operation Latency

​Worker Utilization

​Joining with Resource Attributes

​Alerting

​Health Check Endpoint

​Endpoint Details

​Configuration

​Example Response

​Health Status Types

​Using Health Checks

​Kubernetes Liveness Probe

​Load Balancer Health Check

​Next Steps

Metrics Reference

Troubleshooting

Performance Tuning

Build docs developers (and LLMs) love

Overview

Quick Start with Docker Compose

Kubernetes Deployment

Deploy OTEL Collector

Deploy Prometheus

Configure NativeLink Deployment

Environment Variables

OTLP Exporter Configuration

Resource Attributes

Metric Export Configuration

Metrics Server Options

Prometheus (Recommended)

Direct OTLP Ingestion

Via Collector Scraping

Grafana Cloud

ClickHouse

Grafana Dashboards

Import Pre-built Dashboard

Key Dashboard Panels

Example Queries

Cache Hit Rate

Execution Success Rate

Queue Depth by Priority

P95 Cache Operation Latency

Worker Utilization

Joining with Resource Attributes

Alerting

Health Check Endpoint

Endpoint Details

Configuration

Example Response

Health Status Types

Using Health Checks

Kubernetes Liveness Probe

Load Balancer Health Check

Next Steps