Skip to main content
Karpenter exposes metrics in Prometheus format to allow monitoring of cluster provisioning status.

Metrics endpoint

Metrics are available by default at:
karpenter.kube-system.svc.cluster.local:8080/metrics
The port is configurable via the METRICS_PORT environment variable (default 8080). See Settings for more information.

Setting up Prometheus scraping

Add the following scrape config to your Prometheus configuration to collect Karpenter metrics:
scrape_configs:
  - job_name: karpenter
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - kube-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: karpenter
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: http-metrics
If you use the Prometheus Operator, create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics

Stability levels

Each metric carries a stability level that indicates how likely it is to change:
LevelMeaning
STABLEThe metric name and labels are stable and will not change without a deprecation period.
BETAThe metric may change in a future release but will be announced in release notes.
ALPHAThe metric is experimental and may be removed or renamed without notice.
DEPRECATEDThe metric will be removed in a future release. Migrate to the replacement metric.

General metrics

MetricDescriptionStability
karpenter_ignored_pod_countNumber of pods ignored during scheduling by KarpenterALPHA
karpenter_build_infoA metric with a constant 1 value labeled by version from which karpenter was builtSTABLE

NodeClaims metrics

MetricDescriptionStability
karpenter_nodeclaims_termination_duration_secondsDuration of NodeClaim termination in secondsBETA
karpenter_nodeclaims_terminated_totalNumber of nodeclaims terminated in total by Karpenter. Labeled by the owning nodepoolSTABLE
karpenter_nodeclaims_instance_termination_duration_secondsDuration of CloudProvider Instance termination in secondsBETA
karpenter_nodeclaims_disrupted_totalNumber of nodeclaims disrupted in total by Karpenter. Labeled by reason the nodeclaim was disrupted and the owning nodepoolALPHA
karpenter_nodeclaims_created_totalNumber of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepoolSTABLE
MetricDescriptionStability
operator_nodeclaim_status_condition_transitions_totalThe count of transitions of a nodeclaim, type and status. Labeled by the type, reason, and statusBETA
operator_nodeclaim_status_condition_transition_secondsThe amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespaceBETA
operator_nodeclaim_status_condition_current_status_secondsThe current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reasonBETA
operator_nodeclaim_status_condition_countThe number of a condition for a nodeclaim, type and status. Labeled by the name, namespace, type, status, and reasonBETA
operator_nodeclaim_termination_current_time_secondsThe current amount of time in seconds that a nodeclaim has been in terminating state. Labeled by name, and namespaceBETA
operator_nodeclaim_termination_duration_secondsThe amount of time taken by a nodeclaim to terminate completelyBETA

Nodes metrics

MetricDescriptionStability
karpenter_nodes_total_pod_requestsNode total pod requests are the resources requested by pods bound to nodes, including the DaemonSet podsBETA
karpenter_nodes_total_pod_limitsNode total pod limits are the resources specified by pod limits, including the DaemonSet podsBETA
karpenter_nodes_total_daemon_requestsNode total daemon requests are the resource requested by DaemonSet pods bound to nodesBETA
karpenter_nodes_total_daemon_limitsNode total daemon limits are the resources specified by DaemonSet pod limitsBETA
karpenter_nodes_termination_duration_secondsThe time taken between a node’s deletion request and the removal of its finalizerBETA
karpenter_nodes_terminated_totalNumber of nodes terminated in total by Karpenter. Labeled by owning nodepoolSTABLE
karpenter_nodes_system_overheadNode system daemon overhead are the resources reserved for system overhead, the difference between the node’s capacity and allocatable values are reported by the statusBETA
karpenter_nodes_lifetime_duration_secondsThe lifetime duration of the nodes since creationALPHA
karpenter_nodes_eviction_requests_totalThe total number of eviction requests made by KarpenterALPHA
karpenter_nodes_drained_totalThe total number of nodes drained by KarpenterALPHA
karpenter_nodes_current_lifetime_secondsNode age in secondsALPHA
karpenter_nodes_created_totalNumber of nodes created in total by Karpenter. Labeled by owning nodepoolSTABLE
karpenter_nodes_allocatableNode allocatable are the resources allocatable by nodesBETA
MetricDescriptionStability
operator_node_status_condition_transitions_totalThe count of transitions of a node, type and statusBETA
operator_node_status_condition_transition_secondsThe amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespaceBETA
operator_node_status_condition_current_status_secondsThe current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reasonBETA
operator_node_status_condition_countThe number of a condition for a node, type and status. Labeled by the name, namespace, type, status, and reasonBETA
operator_node_termination_current_time_secondsThe current amount of time in seconds that a node has been in terminating state. Labeled by name, and namespaceBETA
operator_node_termination_duration_secondsThe amount of time taken by a node to terminate completelyBETA
operator_node_event_countThe number of events for a nodeBETA

Pods metrics

MetricDescriptionStability
karpenter_pods_statePod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, nodepool name, zone, architecture, capacity type, instance type and pod phaseBETA
karpenter_pods_startup_duration_secondsThe time from pod creation until the pod is runningSTABLE

Voluntary disruption metrics

MetricDescriptionStability
karpenter_voluntary_disruption_queue_failures_totalThe number of times that an enqueued disruption decision failed. Labeled by disruption methodBETA
karpenter_voluntary_disruption_eligible_nodesNumber of nodes eligible for disruption by Karpenter. Labeled by disruption reasonBETA
karpenter_voluntary_disruption_decisions_totalNumber of disruption decisions performed. Labeled by disruption decision, reason, and consolidation typeSTABLE
karpenter_voluntary_disruption_decision_evaluation_duration_secondsDuration of the disruption decision evaluation process in seconds. Labeled by method and consolidation typeBETA
karpenter_voluntary_disruption_consolidation_timeouts_totalNumber of times the Consolidation algorithm has reached a timeout. Labeled by consolidation typeBETA

Scheduler metrics

MetricDescriptionStability
karpenter_scheduler_scheduling_duration_secondsDuration of scheduling simulations used for deprovisioning and provisioning in secondsSTABLE
karpenter_scheduler_queue_depthThe number of pods currently waiting to be scheduledBETA

NodePools metrics

MetricDescriptionStability
karpenter_nodepools_usageThe amount of resources that have been provisioned for a nodepool. Labeled by nodepool name and resource typeALPHA
karpenter_nodepools_limitLimits specified on the nodepool that restrict the quantity of resources provisioned. Labeled by nodepool name and resource typeALPHA
karpenter_nodepools_allowed_disruptionsThe number of nodes for a given NodePool that can be concurrently disrupting at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any pointALPHA
MetricDescriptionStability
operator_nodepool_status_condition_transitions_totalThe count of transitions of a nodepool, type and status. Labeled by the type, reason, and statusBETA
operator_nodepool_status_condition_transition_secondsThe amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespaceBETA
operator_nodepool_status_condition_current_status_secondsThe current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reasonBETA
operator_nodepool_status_condition_countThe number of a condition for a nodepool, type and status. Labeled by the name, namespace, type, status, and reasonBETA
operator_nodepool_termination_current_time_secondsThe current amount of time in seconds that a nodepool has been in terminating state. Labeled by name, and namespaceBETA
operator_nodepool_termination_duration_secondsDuration of NodePool termination in secondsBETA

EC2NodeClass metrics

MetricDescriptionStability
operator_ec2nodeclass_status_condition_transitions_totalThe count of transitions of a ec2nodeclass, type and status. Labeled by the type, reason, and statusBETA
operator_ec2nodeclass_status_condition_transition_secondsThe amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespaceBETA
operator_ec2nodeclass_status_condition_current_status_secondsThe current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodeclaim, namespace, type, status, and reasonBETA
operator_ec2nodeclass_status_condition_countThe number of a condition for an ec2nodeclass, type and status. Labeled by the name, namespace, type, status, and reasonBETA
operator_ec2nodeclass_termination_current_time_secondsThe current amount of time in seconds that an ec2nodeclass has been in terminating state. Labeled by name, and namespaceBETA
operator_ec2nodeclass_termination_duration_secondsDuration of ec2nodeclass termination in secondsBETA

Interruption metrics

MetricDescriptionStability
karpenter_interruption_received_messages_totalCount of messages received from the SQS queue. Broken down by message type and whether the message was actionableSTABLE
karpenter_interruption_message_queue_duration_secondsAmount of time an interruption message is on the queue before it is processed by karpenterSTABLE
karpenter_interruption_deleted_messages_totalCount of messages deleted from the SQS queueSTABLE

Cluster metrics

MetricDescriptionStability
karpenter_cluster_utilization_percentUtilization of allocatable resources by pod requestsALPHA

Cluster state metrics

MetricDescriptionStability
karpenter_cluster_state_unsynced_time_secondsThe time for which cluster state is not syncedALPHA
karpenter_cluster_state_syncedReturns 1 if cluster state is synced and 0 otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter’s cluster stateSTABLE
karpenter_cluster_state_node_countCurrent count of nodes in cluster stateSTABLE

Cloud provider metrics

MetricDescriptionStability
karpenter_cloudprovider_instance_type_offering_price_estimateInstance type offering estimated hourly price used when making informed decisions on node cost calculation, based on instance type, capacity type, and zoneBETA
karpenter_cloudprovider_instance_type_offering_availableInstance type offering availability, based on instance type, capacity type, and zoneBETA
karpenter_cloudprovider_instance_type_memory_bytesMemory, in bytes, for a given instance typeBETA
karpenter_cloudprovider_instance_type_cpu_coresvCPUs cores for a given instance typeBETA
karpenter_cloudprovider_errors_totalTotal number of errors returned from CloudProvider callsBETA
karpenter_cloudprovider_duration_secondsDuration of cloud provider method calls. Labeled by the controller, method name and providerBETA
MetricDescriptionStability
karpenter_cloudprovider_batcher_batch_time_secondsDuration of the batching window per batcherBETA
karpenter_cloudprovider_batcher_batch_sizeSize of the request batch per batcherBETA

Controller runtime metrics

MetricDescriptionStability
controller_runtime_terminal_reconcile_errors_totalTotal number of terminal reconciliation errors per controllerSTABLE
controller_runtime_reconcile_totalTotal number of reconciliations per controllerSTABLE
controller_runtime_reconcile_time_secondsLength of time per reconciliation per controllerSTABLE
controller_runtime_reconcile_panics_totalTotal number of reconciliation panics per controllerSTABLE
controller_runtime_reconcile_errors_totalTotal number of reconciliation errors per controllerSTABLE
controller_runtime_max_concurrent_reconcilesMaximum number of concurrent reconciles per controllerSTABLE
controller_runtime_conversion_webhook_panics_totalTotal number of conversion webhook panicsSTABLE
controller_runtime_active_workersNumber of currently used workers per controllerSTABLE

Workqueue metrics

MetricDescriptionStability
workqueue_work_duration_secondsHow long in seconds processing an item from workqueue takesSTABLE
workqueue_unfinished_work_secondsHow many seconds of work has been done that is in progress and hasn’t been observed by work_duration. Large values indicate stuck threadsSTABLE
workqueue_retries_totalTotal number of retries handled by workqueueSTABLE
workqueue_queue_duration_secondsHow long in seconds an item stays in workqueue before being requestedSTABLE
workqueue_longest_running_processor_secondsHow many seconds has the longest running processor for workqueue been runningSTABLE
workqueue_depthCurrent depth of workqueue by workqueue and prioritySTABLE
workqueue_adds_totalTotal number of adds handled by workqueueSTABLE

Status condition metrics (deprecated)

These metrics are deprecated. Migrate to the per-resource status condition metrics (e.g., operator_nodeclaim_status_condition_*, operator_node_status_condition_*).
MetricDescriptionStability
operator_status_condition_transitions_totalThe count of transitions of a given object, type and statusDEPRECATED
operator_status_condition_transition_secondsThe amount of time a condition was in a given state before transitioningDEPRECATED
operator_status_condition_current_status_secondsThe current amount of time in seconds that a status condition has been in a specific stateDEPRECATED
operator_status_condition_countThe number of a condition for a given object, type and statusDEPRECATED

Termination metrics (deprecated)

These metrics are deprecated. Use the per-resource termination metrics instead.
MetricDescriptionStability
operator_termination_duration_secondsThe amount of time taken by an object to terminate completelyDEPRECATED
operator_termination_current_time_secondsThe current amount of time in seconds that an object has been in terminating stateDEPRECATED

Kubernetes client metrics

MetricDescriptionStability
client_go_request_totalNumber of HTTP requests, partitioned by status code and methodSTABLE
client_go_request_duration_secondsRequest latency in seconds. Broken down by verb, group, version, kind, and subresourceSTABLE

AWS SDK metrics

MetricDescriptionStability
aws_sdk_go_request_totalThe total number of AWS SDK Go requestsSTABLE
aws_sdk_go_request_retry_countThe total number of AWS SDK Go retry attempts per requestSTABLE
aws_sdk_go_request_duration_secondsLatency of AWS SDK Go requestsSTABLE
aws_sdk_go_request_attempt_totalThe total number of AWS SDK Go request attemptsSTABLE
aws_sdk_go_request_attempt_duration_secondsLatency of AWS SDK Go request attemptsSTABLE

Leader election metrics

MetricDescriptionStability
leader_election_slowpath_totalTotal number of slow path exercised in renewing leader leases. name is the string used to identify the lease. Group by nameSTABLE
leader_election_master_statusGauge of if the reporting system is master of the relevant lease. 0 indicates backup, 1 indicates master. Group by nameSTABLE

Build docs developers (and LLMs) love