Counter metrics are exposed with a
_total suffix when using OTLP ingestion (e.g., nativelink_execution_completed_count_total). The deployment examples and included dashboards use this naming convention.Cache Metrics
Operations Counter
Metric:nativelink_cache_operations_totalType: Counter
Description: Total number of cache operations performed Labels:
cache_type: Type of cache (cas, ac, memory, filesystem)cache_operation_name: Operation type (read, write, delete, evict)cache_operation_result: Operation outcome (hit, miss, expired, success, error)
Operation Names
read
read
Data retrieval operations. Check
cache_operation_result for hit/miss status.write
write
Data storage operations. Result is typically
success or error.delete
delete
Explicit removal operations initiated by clients or cleanup processes.
evict
evict
Automatic evictions due to LRU policy, TTL expiration, or size constraints.
Operation Results
hit
hit
Data found and valid (reads only). Indicates successful cache retrieval.
miss
miss
Data not found (reads only). Requires fetching from upstream or recomputation.
expired
expired
Data found but stale (reads only). May trigger revalidation.
success
success
Operation completed successfully (writes/deletes).
error
error
Operation failed due to I/O error, permission issue, or other failure.
Operation Duration
Metric:nativelink_cache_operation_durationType: Histogram
Description: Cache operation latency in milliseconds Labels:
cache_type: Type of cachecache_operation_name: Operation type
I/O Throughput
Metric:nativelink_cache_io_totalType: Counter
Description: Total bytes read from or written to cache Labels:
cache_type: Type of cachecache_operation_name: Operation type (read, write)
Cache Size
Metric:nativelink_cache_sizeType: Gauge
Description: Current cache size in bytes Labels:
cache_type: Type of cache
Entry Count
Metric:nativelink_cache_entriesType: Gauge
Description: Number of entries currently in cache Labels:
cache_type: Type of cache
Item Size Distribution
Metric:nativelink_cache_item_sizeType: Histogram
Description: Size distribution of cached entries in bytes Labels:
cache_type: Type of cache
Execution Metrics
Stage Duration
Metric:nativelink_execution_stage_durationType: Histogram
Description: Time spent in each execution stage in milliseconds Labels:
execution_stage: Stage name (unknown, cache_check, queued, executing, completed)
Execution Stages
unknown
unknown
Initial state before processing begins.
cache_check
cache_check
Checking action cache for previously computed results.
queued
queued
Waiting for available worker to execute the action.
executing
executing
Running on worker. Duration depends on action complexity.
completed
completed
Finished execution, results available.
Total Duration
Metric:nativelink_execution_total_durationType: Histogram
Description: Total execution time from submission to completion in milliseconds Labels:
execution_instance: Instance identifier
Queue Time
Metric:nativelink_execution_queue_timeType: Histogram
Description: Time spent waiting in queue before execution starts Labels:
execution_priority: Priority level of the action
Active Count
Metric:nativelink_execution_active_countType: Gauge
Description: Current number of actions in each stage Labels:
execution_stage: Stage name
Completed Count
Metric:nativelink_execution_completed_count_totalType: Counter
Description: Total number of completed executions Labels:
execution_result: Result type (success, failure, cancelled, timeout, cache_hit)execution_action_digest: Action digest (high cardinality - use with caution)
Execution Results
success
success
Action completed with exit code 0.
failure
failure
Action completed with non-zero exit code.
cancelled
cancelled
Execution was cancelled by client or scheduler.
timeout
timeout
Execution exceeded configured timeout.
cache_hit
cache_hit
Result found in action cache, execution skipped.
Stage Transitions
Metric:nativelink_execution_stage_transitions_totalType: Counter
Description: Number of stage transition events Labels:
execution_instance: Instance identifierexecution_priority: Priority level
Output Size
Metric:nativelink_execution_output_sizeType: Histogram
Description: Size of execution outputs in bytes
Retry Count
Metric:nativelink_execution_retry_count_totalType: Counter
Description: Number of execution retries due to failures
Recording Rules
NativeLink includes pre-configured recording rules for common queries. These rules pre-calculate expensive queries for better dashboard performance.Execution Recording Rules
Rule:nativelink:execution_success_rateExpression: Success rate over 5-minute window
nativelink:execution_queue_time_p95Expression: 95th percentile queue time
Cache Recording Rules
Rule:nativelink:cache_hit_rateExpression: Cache hit rate by cache type
nativelink:cache_operation_latency_p95Expression: 95th percentile cache operation latency
Performance Recording Rules
Rule:nativelink:system_throughputExpression: Overall system throughput in actions per second
nativelink:worker_utilizationExpression: Percentage of workers actively executing
nativelink:queue_depthExpression: Number of actions waiting in queue
SLO Recording Rules
Service Level Objective (SLO) rules track longer-term performance targets. Rule:nativelink:slo_execution_success_rateTarget: 99% of executions complete successfully
nativelink:slo_cache_read_latencyTarget: 95% of cache reads under 100ms
nativelink:slo_queue_timeTarget: Queue time under 30s for 90% of actions
Resource Attributes
Metrics include resource attributes that can be promoted to labels for filtering:Service Attributes
service.instance.idservice.nameservice.namespaceservice.version
Cloud Attributes
cloud.availability_zonecloud.regiondeployment.environment
Kubernetes Attributes
k8s.cluster.namek8s.namespace.namek8s.pod.namek8s.deployment.name
NativeLink Attributes
nativelink.instance_namenativelink.worker_idnativelink.scheduler_name
Promoting Resource Attributes
Configure Prometheus to promote resource attributes to labels:prometheus-config.yaml
Collector Configuration
The OpenTelemetry Collector can transform metrics before export:otel-collector-config.yaml
Metric Cardinality
Low-Cardinality Labels (Safe)
cache_typecache_operation_nameexecution_stageexecution_resultinstance_name
High-Cardinality Labels (Use with Caution)
execution_action_digest- Unique per actionexecution_worker_id- One per worker- Individual file paths or digests
Best Practices
- Use recording rules to pre-aggregate high-cardinality metrics
- Drop unnecessary labels in the OTEL Collector
- Set retention policies based on cardinality
- Monitor Prometheus memory usage
Next Steps
Monitoring Setup
Configure Prometheus and Grafana
Troubleshooting
Debug metrics collection issues