Metrics and Instrumentation

NativeLink exposes metrics through OpenTelemetry that provide detailed insights into cache operations, remote execution, and system performance.

Counter metrics are exposed with a _total suffix when using OTLP ingestion (e.g., nativelink_execution_completed_count_total). The deployment examples and included dashboards use this naming convention.

Cache Metrics

Operations Counter

Metric: nativelink_cache_operations_total
Type: Counter
Description: Total number of cache operations performed Labels:

cache_type: Type of cache (cas, ac, memory, filesystem)
cache_operation_name: Operation type (read, write, delete, evict)
cache_operation_result: Operation outcome (hit, miss, expired, success, error)

Operation Names

read

Data retrieval operations. Check cache_operation_result for hit/miss status.

write

Data storage operations. Result is typically success or error.

delete

Explicit removal operations initiated by clients or cleanup processes.

evict

Automatic evictions due to LRU policy, TTL expiration, or size constraints.

Operation Results

hit

Data found and valid (reads only). Indicates successful cache retrieval.

miss

Data not found (reads only). Requires fetching from upstream or recomputation.

expired

Data found but stale (reads only). May trigger revalidation.

success

Operation completed successfully (writes/deletes).

error

Operation failed due to I/O error, permission issue, or other failure.

Operation Duration

Metric: nativelink_cache_operation_duration
Type: Histogram
Description: Cache operation latency in milliseconds Labels:

cache_type: Type of cache
cache_operation_name: Operation type

Buckets: Configured by OpenTelemetry SDK defaults

I/O Throughput

Metric: nativelink_cache_io_total
Type: Counter
Description: Total bytes read from or written to cache Labels:

cache_type: Type of cache
cache_operation_name: Operation type (read, write)

Cache Size

Metric: nativelink_cache_size
Type: Gauge
Description: Current cache size in bytes Labels:

cache_type: Type of cache

Entry Count

Metric: nativelink_cache_entries
Type: Gauge
Description: Number of entries currently in cache Labels:

cache_type: Type of cache

Item Size Distribution

Metric: nativelink_cache_item_size
Type: Histogram
Description: Size distribution of cached entries in bytes Labels:

cache_type: Type of cache

Execution Metrics

Stage Duration

Metric: nativelink_execution_stage_duration
Type: Histogram
Description: Time spent in each execution stage in milliseconds Labels:

execution_stage: Stage name (unknown, cache_check, queued, executing, completed)

Execution Stages

unknown

Initial state before processing begins.

cache_check

Checking action cache for previously computed results.

queued

Waiting for available worker to execute the action.

executing

Running on worker. Duration depends on action complexity.

completed

Finished execution, results available.

Total Duration

Metric: nativelink_execution_total_duration
Type: Histogram
Description: Total execution time from submission to completion in milliseconds Labels:

execution_instance: Instance identifier

Queue Time

Metric: nativelink_execution_queue_time
Type: Histogram
Description: Time spent waiting in queue before execution starts Labels:

execution_priority: Priority level of the action

Active Count

Metric: nativelink_execution_active_count
Type: Gauge
Description: Current number of actions in each stage Labels:

execution_stage: Stage name

Completed Count

Metric: nativelink_execution_completed_count_total
Type: Counter
Description: Total number of completed executions Labels:

execution_result: Result type (success, failure, cancelled, timeout, cache_hit)
execution_action_digest: Action digest (high cardinality - use with caution)

Execution Results

success

Action completed with exit code 0.

failure

Action completed with non-zero exit code.

cancelled

Execution was cancelled by client or scheduler.

timeout

Execution exceeded configured timeout.

cache_hit

Result found in action cache, execution skipped.

Stage Transitions

Metric: nativelink_execution_stage_transitions_total
Type: Counter
Description: Number of stage transition events Labels:

execution_instance: Instance identifier
execution_priority: Priority level

Output Size

Metric: nativelink_execution_output_size
Type: Histogram
Description: Size of execution outputs in bytes

Retry Count

Metric: nativelink_execution_retry_count_total
Type: Counter
Description: Number of execution retries due to failures

Recording Rules

NativeLink includes pre-configured recording rules for common queries. These rules pre-calculate expensive queries for better dashboard performance.

Execution Recording Rules

Rule: nativelink:execution_success_rate
Expression: Success rate over 5-minute window

sum by (instance_name, execution_instance) (
  rate(nativelink_execution_completed_count_total{execution_result="success"}[5m])
) /
sum by (instance_name, execution_instance) (
  rate(nativelink_execution_completed_count_total[5m])
)

Rule: nativelink:execution_queue_time_p95
Expression: 95th percentile queue time

histogram_quantile(0.95,
  sum by (le, instance_name, execution_instance) (
    rate(nativelink_execution_queue_time_bucket[5m])
  )
)

Cache Recording Rules

Rule: nativelink:cache_hit_rate
Expression: Cache hit rate by cache type

sum by (cache_type, instance_name) (
  rate(nativelink_cache_operations_total{cache_operation_result="hit"}[5m])
) /
sum by (cache_type, instance_name) (
  rate(nativelink_cache_operations_total{cache_operation_name="read"}[5m])
)

Rule: nativelink:cache_operation_latency_p95
Expression: 95th percentile cache operation latency

histogram_quantile(0.95,
  sum by (le, cache_type, cache_operation_name, instance_name) (
    rate(nativelink_cache_operation_duration_bucket[5m])
  )
)

Performance Recording Rules

Rule: nativelink:system_throughput
Expression: Overall system throughput in actions per second

sum(rate(nativelink_execution_completed_count_total[5m]))

Rule: nativelink:worker_utilization
Expression: Percentage of workers actively executing

count by (instance_name) (
  nativelink_execution_active_count{execution_stage="executing"} > 0
) /
count by (instance_name) (
  nativelink_execution_active_count
)

Rule: nativelink:queue_depth
Expression: Number of actions waiting in queue

sum by (instance_name, execution_priority) (
  nativelink_execution_active_count{execution_stage="queued"}
)

SLO Recording Rules

Service Level Objective (SLO) rules track longer-term performance targets. Rule: nativelink:slo_execution_success_rate
Target: 99% of executions complete successfully

sum(rate(nativelink_execution_completed_count_total{execution_result="success"}[1h])) /
sum(rate(nativelink_execution_completed_count_total[1h]))

Rule: nativelink:slo_cache_read_latency
Target: 95% of cache reads under 100ms

histogram_quantile(0.95,
  sum(rate(nativelink_cache_operation_duration_bucket{cache_operation_name="read"}[1h])) by (le)
) < 0.1

Rule: nativelink:slo_queue_time
Target: Queue time under 30s for 90% of actions

histogram_quantile(0.9,
  sum(rate(nativelink_execution_queue_time_bucket[1h])) by (le)
) < 30

Resource Attributes

Metrics include resource attributes that can be promoted to labels for filtering:

Service Attributes

service.instance.id
service.name
service.namespace
service.version

Cloud Attributes

cloud.availability_zone
cloud.region
deployment.environment

Kubernetes Attributes

k8s.cluster.name
k8s.namespace.name
k8s.pod.name
k8s.deployment.name

NativeLink Attributes

nativelink.instance_name
nativelink.worker_id
nativelink.scheduler_name

Promoting Resource Attributes

Configure Prometheus to promote resource attributes to labels:

prometheus-config.yaml

otlp:
  promote_resource_attributes:
    - nativelink.instance_name
    - nativelink.worker_id
    - k8s.cluster.name
    - deployment.environment

Collector Configuration

The OpenTelemetry Collector can transform metrics before export:

otel-collector-config.yaml

processors:
  # Add resource attributes
  resource:
    attributes:
      - key: deployment.environment
        from_attribute: deployment_environment
        action: insert

  # Transform metrics
  transform/nativelink:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["instance_name"], resource.attributes["nativelink.instance_name"])
            where resource.attributes["nativelink.instance_name"] != nil

  # Batch for efficiency
  batch:
    timeout: 10s
    send_batch_size: 1024

Metric Cardinality

High-cardinality labels like execution_action_digest can cause performance issues in Prometheus. Use recording rules to aggregate these metrics before querying.

Low-Cardinality Labels (Safe)

cache_type
cache_operation_name
execution_stage
execution_result
instance_name

High-Cardinality Labels (Use with Caution)

execution_action_digest - Unique per action
execution_worker_id - One per worker
Individual file paths or digests

Best Practices

Use recording rules to pre-aggregate high-cardinality metrics
Drop unnecessary labels in the OTEL Collector
Set retention policies based on cardinality
Monitor Prometheus memory usage

Next Steps

Monitoring Setup

Configure Prometheus and Grafana

Troubleshooting

Debug metrics collection issues

Getting Started

Core Concepts

Deployment

Integration

Operations

Security

Documentation Index

​Cache Metrics

​Operations Counter

​Operation Names

​Operation Results

​Operation Duration

​I/O Throughput

​Cache Size

​Entry Count

​Item Size Distribution

​Execution Metrics

​Stage Duration

​Execution Stages

​Total Duration

​Queue Time

​Active Count

​Completed Count

​Execution Results

​Stage Transitions

​Output Size

​Retry Count

​Recording Rules

​Execution Recording Rules

​Cache Recording Rules

​Performance Recording Rules

​SLO Recording Rules

​Resource Attributes

Service Attributes

Cloud Attributes

Kubernetes Attributes

NativeLink Attributes

​Promoting Resource Attributes

​Collector Configuration

​Metric Cardinality

​Low-Cardinality Labels (Safe)

​High-Cardinality Labels (Use with Caution)

​Best Practices

​Next Steps

Monitoring Setup

Troubleshooting

Build docs developers (and LLMs) love

Cache Metrics

Operations Counter

Operation Names

Operation Results

Operation Duration

I/O Throughput

Cache Size

Entry Count

Item Size Distribution

Execution Metrics

Stage Duration

Execution Stages

Total Duration

Queue Time

Active Count

Completed Count

Execution Results

Stage Transitions

Output Size

Retry Count

Recording Rules

Execution Recording Rules

Cache Recording Rules

Performance Recording Rules

SLO Recording Rules

Resource Attributes

Promoting Resource Attributes

Collector Configuration

Metric Cardinality

Low-Cardinality Labels (Safe)

High-Cardinality Labels (Use with Caution)

Best Practices

Next Steps