Skip to main content
NativeLink exposes metrics through OpenTelemetry that provide detailed insights into cache operations, remote execution, and system performance.
Counter metrics are exposed with a _total suffix when using OTLP ingestion (e.g., nativelink_execution_completed_count_total). The deployment examples and included dashboards use this naming convention.

Cache Metrics

Operations Counter

Metric: nativelink_cache_operations_total
Type: Counter
Description: Total number of cache operations performed
Labels:
  • cache_type: Type of cache (cas, ac, memory, filesystem)
  • cache_operation_name: Operation type (read, write, delete, evict)
  • cache_operation_result: Operation outcome (hit, miss, expired, success, error)

Operation Names

Data retrieval operations. Check cache_operation_result for hit/miss status.
Data storage operations. Result is typically success or error.
Explicit removal operations initiated by clients or cleanup processes.
Automatic evictions due to LRU policy, TTL expiration, or size constraints.

Operation Results

Data found and valid (reads only). Indicates successful cache retrieval.
Data not found (reads only). Requires fetching from upstream or recomputation.
Data found but stale (reads only). May trigger revalidation.
Operation completed successfully (writes/deletes).
Operation failed due to I/O error, permission issue, or other failure.

Operation Duration

Metric: nativelink_cache_operation_duration
Type: Histogram
Description: Cache operation latency in milliseconds
Labels:
  • cache_type: Type of cache
  • cache_operation_name: Operation type
Buckets: Configured by OpenTelemetry SDK defaults

I/O Throughput

Metric: nativelink_cache_io_total
Type: Counter
Description: Total bytes read from or written to cache
Labels:
  • cache_type: Type of cache
  • cache_operation_name: Operation type (read, write)

Cache Size

Metric: nativelink_cache_size
Type: Gauge
Description: Current cache size in bytes
Labels:
  • cache_type: Type of cache

Entry Count

Metric: nativelink_cache_entries
Type: Gauge
Description: Number of entries currently in cache
Labels:
  • cache_type: Type of cache

Item Size Distribution

Metric: nativelink_cache_item_size
Type: Histogram
Description: Size distribution of cached entries in bytes
Labels:
  • cache_type: Type of cache

Execution Metrics

Stage Duration

Metric: nativelink_execution_stage_duration
Type: Histogram
Description: Time spent in each execution stage in milliseconds
Labels:
  • execution_stage: Stage name (unknown, cache_check, queued, executing, completed)

Execution Stages

Initial state before processing begins.
Checking action cache for previously computed results.
Waiting for available worker to execute the action.
Running on worker. Duration depends on action complexity.
Finished execution, results available.

Total Duration

Metric: nativelink_execution_total_duration
Type: Histogram
Description: Total execution time from submission to completion in milliseconds
Labels:
  • execution_instance: Instance identifier

Queue Time

Metric: nativelink_execution_queue_time
Type: Histogram
Description: Time spent waiting in queue before execution starts
Labels:
  • execution_priority: Priority level of the action

Active Count

Metric: nativelink_execution_active_count
Type: Gauge
Description: Current number of actions in each stage
Labels:
  • execution_stage: Stage name

Completed Count

Metric: nativelink_execution_completed_count_total
Type: Counter
Description: Total number of completed executions
Labels:
  • execution_result: Result type (success, failure, cancelled, timeout, cache_hit)
  • execution_action_digest: Action digest (high cardinality - use with caution)

Execution Results

Action completed with exit code 0.
Action completed with non-zero exit code.
Execution was cancelled by client or scheduler.
Execution exceeded configured timeout.
Result found in action cache, execution skipped.

Stage Transitions

Metric: nativelink_execution_stage_transitions_total
Type: Counter
Description: Number of stage transition events
Labels:
  • execution_instance: Instance identifier
  • execution_priority: Priority level

Output Size

Metric: nativelink_execution_output_size
Type: Histogram
Description: Size of execution outputs in bytes

Retry Count

Metric: nativelink_execution_retry_count_total
Type: Counter
Description: Number of execution retries due to failures

Recording Rules

NativeLink includes pre-configured recording rules for common queries. These rules pre-calculate expensive queries for better dashboard performance.

Execution Recording Rules

Rule: nativelink:execution_success_rate
Expression: Success rate over 5-minute window
sum by (instance_name, execution_instance) (
  rate(nativelink_execution_completed_count_total{execution_result="success"}[5m])
) /
sum by (instance_name, execution_instance) (
  rate(nativelink_execution_completed_count_total[5m])
)
Rule: nativelink:execution_queue_time_p95
Expression: 95th percentile queue time
histogram_quantile(0.95,
  sum by (le, instance_name, execution_instance) (
    rate(nativelink_execution_queue_time_bucket[5m])
  )
)

Cache Recording Rules

Rule: nativelink:cache_hit_rate
Expression: Cache hit rate by cache type
sum by (cache_type, instance_name) (
  rate(nativelink_cache_operations_total{cache_operation_result="hit"}[5m])
) /
sum by (cache_type, instance_name) (
  rate(nativelink_cache_operations_total{cache_operation_name="read"}[5m])
)
Rule: nativelink:cache_operation_latency_p95
Expression: 95th percentile cache operation latency
histogram_quantile(0.95,
  sum by (le, cache_type, cache_operation_name, instance_name) (
    rate(nativelink_cache_operation_duration_bucket[5m])
  )
)

Performance Recording Rules

Rule: nativelink:system_throughput
Expression: Overall system throughput in actions per second
sum(rate(nativelink_execution_completed_count_total[5m]))
Rule: nativelink:worker_utilization
Expression: Percentage of workers actively executing
count by (instance_name) (
  nativelink_execution_active_count{execution_stage="executing"} > 0
) /
count by (instance_name) (
  nativelink_execution_active_count
)
Rule: nativelink:queue_depth
Expression: Number of actions waiting in queue
sum by (instance_name, execution_priority) (
  nativelink_execution_active_count{execution_stage="queued"}
)

SLO Recording Rules

Service Level Objective (SLO) rules track longer-term performance targets. Rule: nativelink:slo_execution_success_rate
Target: 99% of executions complete successfully
sum(rate(nativelink_execution_completed_count_total{execution_result="success"}[1h])) /
sum(rate(nativelink_execution_completed_count_total[1h]))
Rule: nativelink:slo_cache_read_latency
Target: 95% of cache reads under 100ms
histogram_quantile(0.95,
  sum(rate(nativelink_cache_operation_duration_bucket{cache_operation_name="read"}[1h])) by (le)
) < 0.1
Rule: nativelink:slo_queue_time
Target: Queue time under 30s for 90% of actions
histogram_quantile(0.9,
  sum(rate(nativelink_execution_queue_time_bucket[1h])) by (le)
) < 30

Resource Attributes

Metrics include resource attributes that can be promoted to labels for filtering:

Service Attributes

  • service.instance.id
  • service.name
  • service.namespace
  • service.version

Cloud Attributes

  • cloud.availability_zone
  • cloud.region
  • deployment.environment

Kubernetes Attributes

  • k8s.cluster.name
  • k8s.namespace.name
  • k8s.pod.name
  • k8s.deployment.name

NativeLink Attributes

  • nativelink.instance_name
  • nativelink.worker_id
  • nativelink.scheduler_name

Promoting Resource Attributes

Configure Prometheus to promote resource attributes to labels:
prometheus-config.yaml
otlp:
  promote_resource_attributes:
    - nativelink.instance_name
    - nativelink.worker_id
    - k8s.cluster.name
    - deployment.environment

Collector Configuration

The OpenTelemetry Collector can transform metrics before export:
otel-collector-config.yaml
processors:
  # Add resource attributes
  resource:
    attributes:
      - key: deployment.environment
        from_attribute: deployment_environment
        action: insert

  # Transform metrics
  transform/nativelink:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["instance_name"], resource.attributes["nativelink.instance_name"])
            where resource.attributes["nativelink.instance_name"] != nil

  # Batch for efficiency
  batch:
    timeout: 10s
    send_batch_size: 1024

Metric Cardinality

High-cardinality labels like execution_action_digest can cause performance issues in Prometheus. Use recording rules to aggregate these metrics before querying.

Low-Cardinality Labels (Safe)

  • cache_type
  • cache_operation_name
  • execution_stage
  • execution_result
  • instance_name

High-Cardinality Labels (Use with Caution)

  • execution_action_digest - Unique per action
  • execution_worker_id - One per worker
  • Individual file paths or digests

Best Practices

  1. Use recording rules to pre-aggregate high-cardinality metrics
  2. Drop unnecessary labels in the OTEL Collector
  3. Set retention policies based on cardinality
  4. Monitor Prometheus memory usage

Next Steps

Monitoring Setup

Configure Prometheus and Grafana

Troubleshooting

Debug metrics collection issues

Build docs developers (and LLMs) love