Metrics

Karpenter exposes metrics in Prometheus format to allow monitoring of cluster provisioning status.

Metrics endpoint

Metrics are available by default at:

karpenter.kube-system.svc.cluster.local:8080/metrics

The port is configurable via the METRICS_PORT environment variable (default 8080). See Settings for more information.

Setting up Prometheus scraping

Add the following scrape config to your Prometheus configuration to collect Karpenter metrics:

scrape_configs:
  - job_name: karpenter
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - kube-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: karpenter
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: http-metrics

If you use the Prometheus Operator, create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics

Stability levels

Each metric carries a stability level that indicates how likely it is to change:

Level	Meaning
`STABLE`	The metric name and labels are stable and will not change without a deprecation period.
`BETA`	The metric may change in a future release but will be announced in release notes.
`ALPHA`	The metric is experimental and may be removed or renamed without notice.
`DEPRECATED`	The metric will be removed in a future release. Migrate to the replacement metric.

General metrics

Metric	Description	Stability
`karpenter_ignored_pod_count`	Number of pods ignored during scheduling by Karpenter	ALPHA
`karpenter_build_info`	A metric with a constant `1` value labeled by version from which karpenter was built	STABLE

NodeClaims metrics

NodeClaim lifecycle and disruption

Metric	Description	Stability
`karpenter_nodeclaims_termination_duration_seconds`	Duration of NodeClaim termination in seconds	BETA
`karpenter_nodeclaims_terminated_total`	Number of nodeclaims terminated in total by Karpenter. Labeled by the owning nodepool	STABLE
`karpenter_nodeclaims_instance_termination_duration_seconds`	Duration of CloudProvider Instance termination in seconds	BETA
`karpenter_nodeclaims_disrupted_total`	Number of nodeclaims disrupted in total by Karpenter. Labeled by reason the nodeclaim was disrupted and the owning nodepool	ALPHA
`karpenter_nodeclaims_created_total`	Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool	STABLE

NodeClaim operator status conditions

Metric	Description	Stability
`operator_nodeclaim_status_condition_transitions_total`	The count of transitions of a nodeclaim, type and status. Labeled by the type, reason, and status	BETA
`operator_nodeclaim_status_condition_transition_seconds`	The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace	BETA
`operator_nodeclaim_status_condition_current_status_seconds`	The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reason	BETA
`operator_nodeclaim_status_condition_count`	The number of a condition for a nodeclaim, type and status. Labeled by the name, namespace, type, status, and reason	BETA
`operator_nodeclaim_termination_current_time_seconds`	The current amount of time in seconds that a nodeclaim has been in terminating state. Labeled by name, and namespace	BETA
`operator_nodeclaim_termination_duration_seconds`	The amount of time taken by a nodeclaim to terminate completely	BETA

Nodes metrics

Node resource and lifecycle

Metric	Description	Stability
`karpenter_nodes_total_pod_requests`	Node total pod requests are the resources requested by pods bound to nodes, including the DaemonSet pods	BETA
`karpenter_nodes_total_pod_limits`	Node total pod limits are the resources specified by pod limits, including the DaemonSet pods	BETA
`karpenter_nodes_total_daemon_requests`	Node total daemon requests are the resource requested by DaemonSet pods bound to nodes	BETA
`karpenter_nodes_total_daemon_limits`	Node total daemon limits are the resources specified by DaemonSet pod limits	BETA
`karpenter_nodes_termination_duration_seconds`	The time taken between a node’s deletion request and the removal of its finalizer	BETA
`karpenter_nodes_terminated_total`	Number of nodes terminated in total by Karpenter. Labeled by owning nodepool	STABLE
`karpenter_nodes_system_overhead`	Node system daemon overhead are the resources reserved for system overhead, the difference between the node’s capacity and allocatable values are reported by the status	BETA
`karpenter_nodes_lifetime_duration_seconds`	The lifetime duration of the nodes since creation	ALPHA
`karpenter_nodes_eviction_requests_total`	The total number of eviction requests made by Karpenter	ALPHA
`karpenter_nodes_drained_total`	The total number of nodes drained by Karpenter	ALPHA
`karpenter_nodes_current_lifetime_seconds`	Node age in seconds	ALPHA
`karpenter_nodes_created_total`	Number of nodes created in total by Karpenter. Labeled by owning nodepool	STABLE
`karpenter_nodes_allocatable`	Node allocatable are the resources allocatable by nodes	BETA

Node operator status conditions

Metric	Description	Stability
`operator_node_status_condition_transitions_total`	The count of transitions of a node, type and status	BETA
`operator_node_status_condition_transition_seconds`	The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace	BETA
`operator_node_status_condition_current_status_seconds`	The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reason	BETA
`operator_node_status_condition_count`	The number of a condition for a node, type and status. Labeled by the name, namespace, type, status, and reason	BETA
`operator_node_termination_current_time_seconds`	The current amount of time in seconds that a node has been in terminating state. Labeled by name, and namespace	BETA
`operator_node_termination_duration_seconds`	The amount of time taken by a node to terminate completely	BETA
`operator_node_event_count`	The number of events for a node	BETA

Pods metrics

Metric	Description	Stability
`karpenter_pods_state`	Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, nodepool name, zone, architecture, capacity type, instance type and pod phase	BETA
`karpenter_pods_startup_duration_seconds`	The time from pod creation until the pod is running	STABLE

Voluntary disruption metrics

Metric	Description	Stability
`karpenter_voluntary_disruption_queue_failures_total`	The number of times that an enqueued disruption decision failed. Labeled by disruption method	BETA
`karpenter_voluntary_disruption_eligible_nodes`	Number of nodes eligible for disruption by Karpenter. Labeled by disruption reason	BETA
`karpenter_voluntary_disruption_decisions_total`	Number of disruption decisions performed. Labeled by disruption decision, reason, and consolidation type	STABLE
`karpenter_voluntary_disruption_decision_evaluation_duration_seconds`	Duration of the disruption decision evaluation process in seconds. Labeled by method and consolidation type	BETA
`karpenter_voluntary_disruption_consolidation_timeouts_total`	Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type	BETA

Scheduler metrics

Metric	Description	Stability
`karpenter_scheduler_scheduling_duration_seconds`	Duration of scheduling simulations used for deprovisioning and provisioning in seconds	STABLE
`karpenter_scheduler_queue_depth`	The number of pods currently waiting to be scheduled	BETA

NodePools metrics

NodePool resource and disruption

Metric	Description	Stability
`karpenter_nodepools_usage`	The amount of resources that have been provisioned for a nodepool. Labeled by nodepool name and resource type	ALPHA
`karpenter_nodepools_limit`	Limits specified on the nodepool that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type	ALPHA
`karpenter_nodepools_allowed_disruptions`	The number of nodes for a given NodePool that can be concurrently disrupting at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any point	ALPHA

NodePool operator status conditions

Metric	Description	Stability
`operator_nodepool_status_condition_transitions_total`	The count of transitions of a nodepool, type and status. Labeled by the type, reason, and status	BETA
`operator_nodepool_status_condition_transition_seconds`	The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace	BETA
`operator_nodepool_status_condition_current_status_seconds`	The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reason	BETA
`operator_nodepool_status_condition_count`	The number of a condition for a nodepool, type and status. Labeled by the name, namespace, type, status, and reason	BETA
`operator_nodepool_termination_current_time_seconds`	The current amount of time in seconds that a nodepool has been in terminating state. Labeled by name, and namespace	BETA
`operator_nodepool_termination_duration_seconds`	Duration of NodePool termination in seconds	BETA

EC2NodeClass metrics

Metric	Description	Stability
`operator_ec2nodeclass_status_condition_transitions_total`	The count of transitions of a ec2nodeclass, type and status. Labeled by the type, reason, and status	BETA
`operator_ec2nodeclass_status_condition_transition_seconds`	The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace	BETA
`operator_ec2nodeclass_status_condition_current_status_seconds`	The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reason	BETA
`operator_ec2nodeclass_status_condition_count`	The number of a condition for an ec2nodeclass, type and status. Labeled by the name, namespace, type, status, and reason	BETA
`operator_ec2nodeclass_termination_current_time_seconds`	The current amount of time in seconds that an ec2nodeclass has been in terminating state. Labeled by name, and namespace	BETA
`operator_ec2nodeclass_termination_duration_seconds`	Duration of ec2nodeclass termination in seconds	BETA

Interruption metrics

Metric	Description	Stability
`karpenter_interruption_received_messages_total`	Count of messages received from the SQS queue. Broken down by message type and whether the message was actionable	STABLE
`karpenter_interruption_message_queue_duration_seconds`	Amount of time an interruption message is on the queue before it is processed by karpenter	STABLE
`karpenter_interruption_deleted_messages_total`	Count of messages deleted from the SQS queue	STABLE

Cluster metrics

Metric	Description	Stability
`karpenter_cluster_utilization_percent`	Utilization of allocatable resources by pod requests	ALPHA

Cluster state metrics

Metric	Description	Stability
`karpenter_cluster_state_unsynced_time_seconds`	The time for which cluster state is not synced	ALPHA
`karpenter_cluster_state_synced`	Returns `1` if cluster state is synced and `0` otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter’s cluster state	STABLE
`karpenter_cluster_state_node_count`	Current count of nodes in cluster state	STABLE

Cloud provider metrics

CloudProvider calls

Metric	Description	Stability
`karpenter_cloudprovider_instance_type_offering_price_estimate`	Instance type offering estimated hourly price used when making informed decisions on node cost calculation, based on instance type, capacity type, and zone	BETA
`karpenter_cloudprovider_instance_type_offering_available`	Instance type offering availability, based on instance type, capacity type, and zone	BETA
`karpenter_cloudprovider_instance_type_memory_bytes`	Memory, in bytes, for a given instance type	BETA
`karpenter_cloudprovider_instance_type_cpu_cores`	vCPUs cores for a given instance type	BETA
`karpenter_cloudprovider_errors_total`	Total number of errors returned from CloudProvider calls	BETA
`karpenter_cloudprovider_duration_seconds`	Duration of cloud provider method calls. Labeled by the controller, method name and provider	BETA

CloudProvider batcher

Metric	Description	Stability
`karpenter_cloudprovider_batcher_batch_time_seconds`	Duration of the batching window per batcher	BETA
`karpenter_cloudprovider_batcher_batch_size`	Size of the request batch per batcher	BETA

Controller runtime metrics

Metric	Description	Stability
`controller_runtime_terminal_reconcile_errors_total`	Total number of terminal reconciliation errors per controller	STABLE
`controller_runtime_reconcile_total`	Total number of reconciliations per controller	STABLE
`controller_runtime_reconcile_time_seconds`	Length of time per reconciliation per controller	STABLE
`controller_runtime_reconcile_panics_total`	Total number of reconciliation panics per controller	STABLE
`controller_runtime_reconcile_errors_total`	Total number of reconciliation errors per controller	STABLE
`controller_runtime_max_concurrent_reconciles`	Maximum number of concurrent reconciles per controller	STABLE
`controller_runtime_conversion_webhook_panics_total`	Total number of conversion webhook panics	STABLE
`controller_runtime_active_workers`	Number of currently used workers per controller	STABLE

Workqueue metrics

Metric	Description	Stability
`workqueue_work_duration_seconds`	How long in seconds processing an item from workqueue takes	STABLE
`workqueue_unfinished_work_seconds`	How many seconds of work has been done that is in progress and hasn’t been observed by work_duration. Large values indicate stuck threads	STABLE
`workqueue_retries_total`	Total number of retries handled by workqueue	STABLE
`workqueue_queue_duration_seconds`	How long in seconds an item stays in workqueue before being requested	STABLE
`workqueue_longest_running_processor_seconds`	How many seconds has the longest running processor for workqueue been running	STABLE
`workqueue_depth`	Current depth of workqueue by workqueue and priority	STABLE
`workqueue_adds_total`	Total number of adds handled by workqueue	STABLE

Status condition metrics (deprecated)

These metrics are deprecated. Migrate to the per-resource status condition metrics (e.g., operator_nodeclaim_status_condition_*, operator_node_status_condition_*).

Metric	Description	Stability
`operator_status_condition_transitions_total`	The count of transitions of a given object, type and status	DEPRECATED
`operator_status_condition_transition_seconds`	The amount of time a condition was in a given state before transitioning	DEPRECATED
`operator_status_condition_current_status_seconds`	The current amount of time in seconds that a status condition has been in a specific state	DEPRECATED
`operator_status_condition_count`	The number of a condition for a given object, type and status	DEPRECATED

Termination metrics (deprecated)

These metrics are deprecated. Use the per-resource termination metrics instead.

Metric	Description	Stability
`operator_termination_duration_seconds`	The amount of time taken by an object to terminate completely	DEPRECATED
`operator_termination_current_time_seconds`	The current amount of time in seconds that an object has been in terminating state	DEPRECATED

Kubernetes client metrics

Metric	Description	Stability
`client_go_request_total`	Number of HTTP requests, partitioned by status code and method	STABLE
`client_go_request_duration_seconds`	Request latency in seconds. Broken down by verb, group, version, kind, and subresource	STABLE

AWS SDK metrics

Metric	Description	Stability
`aws_sdk_go_request_total`	The total number of AWS SDK Go requests	STABLE
`aws_sdk_go_request_retry_count`	The total number of AWS SDK Go retry attempts per request	STABLE
`aws_sdk_go_request_duration_seconds`	Latency of AWS SDK Go requests	STABLE
`aws_sdk_go_request_attempt_total`	The total number of AWS SDK Go request attempts	STABLE
`aws_sdk_go_request_attempt_duration_seconds`	Latency of AWS SDK Go request attempts	STABLE

Leader election metrics

Metric	Description	Stability
`leader_election_slowpath_total`	Total number of slow path exercised in renewing leader leases. `name` is the string used to identify the lease. Group by name	STABLE
`leader_election_master_status`	Gauge of if the reporting system is master of the relevant lease. `0` indicates backup, `1` indicates master. Group by name	STABLE

Get Started

Concepts

Guides

Reference

Help

Metrics endpoint

Setting up Prometheus scraping

Stability levels

General metrics

NodeClaims metrics

Nodes metrics

Pods metrics

Voluntary disruption metrics

Scheduler metrics

NodePools metrics

EC2NodeClass metrics

Interruption metrics

Cluster metrics

Cluster state metrics

Cloud provider metrics

Controller runtime metrics

Workqueue metrics

Status condition metrics (deprecated)

Termination metrics (deprecated)

Kubernetes client metrics

AWS SDK metrics

Leader election metrics

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

Help

Documentation Index

​Metrics endpoint

​Setting up Prometheus scraping

​Stability levels

​General metrics

​NodeClaims metrics

​Nodes metrics

​Pods metrics

​Voluntary disruption metrics

​Scheduler metrics

​NodePools metrics

​EC2NodeClass metrics

​Interruption metrics

​Cluster metrics

​Cluster state metrics

​Cloud provider metrics

​Controller runtime metrics

​Workqueue metrics

​Status condition metrics (deprecated)

​Termination metrics (deprecated)

​Kubernetes client metrics

​AWS SDK metrics

​Leader election metrics

Build docs developers (and LLMs) love

Metrics endpoint

Setting up Prometheus scraping

Stability levels

General metrics

NodeClaims metrics

Nodes metrics

Pods metrics

Voluntary disruption metrics

Scheduler metrics

NodePools metrics

EC2NodeClass metrics

Interruption metrics

Cluster metrics

Cluster state metrics

Cloud provider metrics

Controller runtime metrics

Workqueue metrics

Status condition metrics (deprecated)

Termination metrics (deprecated)

Kubernetes client metrics

AWS SDK metrics

Leader election metrics