Troubleshooting

This guide covers common issues you may encounter when running Karpenter, grouped by category. Expand any section to see the problem description and resolution steps.

Controller

Enable debug logging

Update the LOG_LEVEL environment variable on the Karpenter deployment, then restart it.You can also enable debug logging at install time with Helm:

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --set logLevel=debug \
  ...

Installation

Missing service linked role

Unless your AWS account has already onboarded to EC2 Spot, you need to create the service linked role to avoid ServiceLinkedRoleCreationNotPermitted.

AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances

Create the Service Linked Role:

aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

Failed resolving STS credentials with I/O timeout

Checking EC2 API connectivity, WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.us-east-1.amazonaws.com/": dial tcp: lookup sts.us-east-1.amazonaws.com: i/o timeout

This error means Karpenter cannot reach the STS endpoint due to failed DNS resolution. This typically happens when Karpenter is running with dnsPolicy: ClusterFirst and your in-cluster DNS service is not yet running.You have two options to resolve this:

Let Karpenter manage your in-cluster DNS service — Change Karpenter’s dnsPolicy to Default (--set dnsPolicy=Default with Helm). This causes Karpenter to use the VPC DNS service directly, allowing it to start up without the DNS application pods running.
Let MNG/Fargate manage your in-cluster DNS service — If running with MNG, ensure the node group has enough capacity to support the DNS application pods with the correct tolerations. If running with Fargate, ensure you have a Fargate profile that selects the DNS application pods.

Karpenter role names exceeding 64-character limit

If you use a tool like AWS CDK to generate your cluster name, the resulting Karpenter node role name may exceed the 64-character limit.Node role names follow the pattern KarpenterNodeRole-${Cluster_Name}. If a long cluster name causes this to exceed 64 characters, object creation will fail.

KarpenterNodeRole- is just a recommendation from the getting started guide. You can shorten the name to anything you like, as long as it has the correct permissions.

Unknown field in NodePool or EC2NodeClass spec

When upgrading from an older version of Karpenter, CRD changes between versions may cause this error:

Error from server (BadRequest): error when creating "STDIN": NodePool in version "v1" cannot be handled as a NodePool: strict decoding error: unknown field "spec.template.spec.nodeClassRef.foo"

Follow the Custom Resource Definition Upgrade Guidance to resolve this. Check the Release Notes for CRD changes between versions.

Unable to schedule pod due to insufficient node group instances

Karpenter 0.16.0 changed the default replica count from 1 to 2. Karpenter will not launch capacity to run itself (due to the karpenter.sh/nodepool DoesNotExist requirement), so it cannot provision for the second Karpenter pod.To resolve this, either:

Reduce the replica count from 2 to 1, or
Ensure there is enough non-Karpenter-managed capacity to run both pods. On AWS, increase the minimum and desired parameters on the node group autoscaling group to launch at least 2 instances.

Helm error when pulling the chart

If Helm shows an error when installing Karpenter Helm charts:

Ensure you are using Helm 3.8.0 or newer (OCI image support was added in this release).
Helm does not have a helm repo add concept for OCI, so you no longer need that step.
If you see Error: public.ecr.aws/karpenter/karpenter:0.34.0: not found, add a v prefix for Karpenter versions between 0.17.0 and 0.34.x.
Verify the image exists in gallery.ecr.aws/karpenter.
Add the --debug flag to Helm commands for more verbose error messages.
For 403 Forbidden errors, run docker logout public.ecr.aws as described in the ECR troubleshooting docs.

Helm error when installing the karpenter-crd chart

Karpenter 0.26.1 introduced the karpenter-crd Helm chart. If you previously added Karpenter CRDs to your cluster through the controller chart or via kubectl replace, Helm will reject the install due to invalid ownership metadata.For invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by":

kubectl label crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh \
  app.kubernetes.io/managed-by=Helm --overwrite

For annotation validation error: missing key "meta.helm.sh/release-namespace":

KARPENTER_NAMESPACE=kube-system
kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh \
  meta.helm.sh/release-name=karpenter-crd --overwrite
kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh \
  meta.helm.sh/release-namespace="${KARPENTER_NAMESPACE}" --overwrite

Uninstallation

Unable to delete nodes after uninstalling Karpenter

Karpenter adds a finalizer to nodes it provisions to support graceful termination. After uninstalling Karpenter, these finalizers cause the API Server to block deletion until they are removed.Fix this by patching the node objects using either method:

Edit the node manually and remove the karpenter.sh/termination line from the finalizers field:
```
kubectl edit node <node_name>
```

Or run the following script to remove the finalizer from all Karpenter-managed nodes:

This removes ALL finalizers from nodes that have the Karpenter finalizer.

kubectl get nodes -ojsonpath='{range .items[*].metadata}{@.name}:{@.finalizers}{"\n"}' \
  | grep "karpenter.sh/termination" \
  | cut -d ':' -f 1 \
  | xargs kubectl patch node --type='json' -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

Provisioning

Instances with swap volumes fail to register with the control plane

Some instance types (e.g. c1.medium and m1.small) are configured with a swap volume, which causes the kubelet to fail on launch:

"command failed" err="failed to run Kubelet: running with swap on is not supported, please disable swap!..."

Disabling swap allows the kubelet to join the cluster. Consider adjusting your NodePool requirements to use larger instance types to avoid these limited instances.See Instance Store swap volumes for details.

DaemonSets can result in deployment failures

For Karpenter versions 0.5.3 and earlier, DaemonSets were not properly considered when provisioning nodes, sometimes causing nodes to be deployed that could not meet DaemonSet and workload requirements. This issue was resolved in 0.5.3 (PR #1155).If you are on a pre-0.5.3 version, a workaround is to configure your NodePool to only use larger instance types that you know will be big enough for the DaemonSet and the workload.

Unspecified resource requests cause scheduling/bin-pack failures

If pods have very low or non-existent resource requests, Karpenter will pack too many pods onto the same node, leading to CPU throttling or OOM kills. This behavior is not unique to Karpenter — it also affects the standard kube-scheduler.

Use Kubernetes LimitRanges to enforce minimum resource request sizes on a per-namespace basis. See the Karpenter Best Practices Guide for more information.

Pods using Security Groups for Pods stuck in ContainerCreating for up to 30 minutes

When using Security Groups for Pods, pods may be stuck in ContainerCreating for up to 30 minutes before transitioning to Running. This is caused by an interaction between Karpenter and the amazon-vpc-resource-controller when a pod requests vpc.amazonaws.com/pod-eni resources.As a workaround, add the vpc.amazonaws.com/has-trunk-attached: "false" label to your NodePool spec and ensure instance type requirements include instance types that support ENI trunking:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        vpc.amazonaws.com/has-trunk-attached: "false"

Pods using PVCs can hit volume limits and fail to scale-up

When scheduling a large number of pods with PersistentVolumes, pods may co-locate on the same node and report errors like:

Warning   FailedAttachVolume    pod/example-pod    AttachVolume.Attach failed for volume "***" : ...
Warning   FailedMount           pod/example-pod    Unable to attach or mount volumes: ...

There are two causes:In-tree storage plugins (unsupported by Karpenter)Karpenter does not support in-tree storage plugins. If you use a StorageClass with a provisioner like kubernetes.io/aws-ebs or a PV with AWSElasticBlockStore, Karpenter cannot discover volume attachment limits and may schedule too many pods to a node. You will see log messages like:

ERROR   controller.node_state   StorageClass .spec.provisioner uses an in-tree storage plugin which is unsupported by Karpenter...

Upgrade your StorageClasses and PersistentVolumes to use CSI drivers (e.g. ebs.csi.aws.com).Race condition between the scheduler and CSINodeDue to a race condition in Kubernetes, the scheduler may assume a node can mount more volumes than it actually supports. Enforce topologySpreadConstraints and podAntiAffinity on workloads using PVCs to reduce co-location.The following CSI drivers support a startupTaint to eliminate this race:

Configure these via startupTaints on your NodePool. For EBS:

apiVersion: karpenter.sh/v1
kind: NodePool
spec:
  template:
    spec:
      startupTaints:
        - key: ebs.csi.aws.com/agent-not-ready
          effect: NoExecute

CNI is unable to allocate IPs to pods

This guidance is specific to the VPC CNI shipped by default with EKS clusters. If you are using a custom CNI, some of this may not apply.

You may see an error like:

time=2023-06-12T19:18:15Z type=Warning reason=FailedCreatePodSandBox from=kubelet message=Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

maxPods is greater than the node’s supported pod densityThe number of pods on a node is limited by the number of ENIs attachable to the instance type and the number of IPs per ENI. If maxPods in your EC2NodeClass kubeletConfiguration exceeds the supported IP count for a given instance type, the CNI will fail to assign an IP and pods will be stuck in ContainerCreating.If you have enabled Security Groups per Pod, one ENI is reserved as the trunk interface, which further reduces the available IPs. Karpenter does not account for this reservation.To resolve:

Enable Prefix Delegation to increase allocatable IPs per ENI.
Reduce your maxPods value to be within the instance type’s pod density limit.
Remove the maxPods value from kubeletConfiguration to rely on Karpenter and EKS AMI defaults.
Set RESERVED_ENIS=1 in your Karpenter configuration when using Security Groups for Pods.

IP exhaustion in a subnetWhen a subnet becomes IP-constrained, EC2 may still launch an instance, but the CNI cannot assign IPs to pods. Pods will remain in ContainerCreating until an IP is freed.To resolve:

Use topologySpreadConstraints on topology.kubernetes.io/zone to spread pods and nodes across zones.
Increase the IP address space (CIDR) for subnets in your EC2NodeClass.
Use custom networking to assign separate IP spaces to pods and nodes.
Run your EKS cluster on IPv6.

See the EKS CreatePodSandbox Knowledge Center Post for additional guidance.

Windows pods failing with FailedCreatedPodSandbox

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="vpc-bridge" name="vpc" failed (add): failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address

This typically occurs when Windows support has not been enabled on the cluster.See Enabling Windows support for instructions.

Windows pods fail to launch with image pull error

Failed to pull image "mcr.microsoft.com/windows/servercore:xxx": rpc error: code = NotFound desc = failed to pull and unpack image "...": no match for platform in manifest: not found

This occurs when a pod with a given container OS version is scheduled on an incompatible Windows host OS version. Windows requires the host OS version to match the container OS version.Define your pod’s nodeSelector to ensure containers are scheduled on a compatible host OS version. See Windows container version compatibility for details.

Windows pods unable to resolve DNS

If DNS resolution works for Linux pods but not for Windows pods, verify that the instance role of the Windows node includes the RBAC permission group eks:kube-proxy-windows:

username: system:node:{{EC2PrivateDNSName}}
groups:
  - system:bootstrappers
  - system:nodes
  - eks:kube-proxy-windows # Required for Windows DNS resolution

This group is required because in Windows, kube-proxy runs as a process on the node and needs RBAC cluster permissions to access required resources. See the EKS Windows support docs for more information.

Karpenter incorrectly computes available resources for a node

The allocatable resources Karpenter computes (visible in logs and nodeClaim.status.allocatable) may not always match the actual allocatable resources on the node (node.status.allocatable) due to memory reserved by the hypervisor and OS.Karpenter uses ec2:DescribeInstanceTypes and a cache of observed memory capacity. For the first launch of a given instance type + AMI pair, the VM_MEMORY_OVERHEAD_PERCENT setting is used as a fallback (default: 7.5%). After a node is created, the actual memory capacity is cached and used for future launches of the same pair.The default 7.5% value is tuned to avoid overestimation across most instance types, meaning Karpenter will typically slightly underestimate available memory. If you know the exact overhead for your instances, you can tune this value, but do so with caution: overestimating memory can cause Karpenter to launch nodes that are too small for your workloads.To detect cases where Karpenter is overestimating resource availability, monitor this status condition:

kubectl get nodeclaim $NODECLAIM_NAME -o jsonpath='{.status.conditions[?(@.type=="ConsistentStateFound")]}'

Or monitor via the metric:

operator_status_condition_count{type="ConsistentStateFound",kind="NodeClaim",status="False"}

Karpenter is unable to satisfy topology spread constraints

When scheduling pods with TopologySpreadConstraints, Karpenter derives eligible domains from the pod’s requirements — not from the compatible NodePools. This can result in Karpenter attempting to provision capacity in domains that no compatible NodePool can actually serve.For example, if a pod has no zonal constraints but its only compatible NodePool is restricted to two out of three zones, Karpenter will succeed for the first two replicas but fail for any replica that requires placement in the third zone.To resolve this, ensure that all eligible domains for a pod can be provisioned by compatible NodePools, or add matching zonal constraints to the pod spec:

nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ['us-east-1a', 'us-east-1b']

'No instance type met the scheduling requirements or had a required offering'

This error indicates that no available instance type meets the pod’s scheduling requirements.Common causes:

The pod has resource requests that require a minimum instance size, but the NodePool is restricted to an instance family or size that cannot satisfy them.
DaemonSet resource requests are accounted for when evaluating instance compatibility and may push the minimum required size above what is available.
The pod is restricted to a specific availability zone where the required capacity type is not available. This commonly happens with StatefulSet pods that had an EBS volume attached in a different AZ than the one currently being targeted.

Deprovisioning

Nodes not deprovisioned

Several conditions can prevent Karpenter from deprovisioning a node:Node not initializedKarpenter only considers nodes for deprovisioning that have the karpenter.sh/initialized label set. If this label is absent, the node will not be deprovision-eligible. See the Nodes not initialized section for details.Pod Disruption Budgets (PDBs)Karpenter respects PDBs using a backoff retry eviction strategy. Pods that fail to shut down will block node deprovisioning. For example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myapp-pdb
spec:
  minAvailable: 4
  selector:
    matchLabels:
      app: myapp

Review what disruptions are and how to configure PDBs.karpenter.sh/do-not-disrupt annotationIf any pod on a node has the annotation karpenter.sh/do-not-disrupt: "true", Karpenter will not drain pods from or delete that node. To resume deprovisioning, remove the annotation from the pod.Scheduling constraints (consolidation only)Consolidation will not proceed if its scheduling simulation determines that the pods on a node cannot run on other nodes due to inter-pod affinity/anti-affinity, topology spread constraints, or other scheduling restrictions.

Node launch and readiness

Node not created

Karpenter may fail to start a node due to a misconfiguration. For example, providing an incorrect block storage device name in a custom launch template produces an error like:

2022-01-19T18:22:23.366Z ERROR controller.provisioning Could not launch node, launching instances, with fleet error(s), InvalidBlockDeviceMapping: Invalid device name /dev/xvda; ...

View Karpenter controller logs to diagnose:

kubectl get pods -A | grep karpenter
kubectl logs karpenter-XXXX -c controller -n karpenter | less

Nodes not initialized

Karpenter uses node initialization to determine when to use the real node capacity for scheduling and when to begin considering nodes for consolidation. A node is considered initialized when all three of the following conditions are true:

Node readiness — The Ready condition type is True.
Expected resources are registered — All expected resources from ec2:DescribeInstanceTypes appear in node.status.allocatable with a non-zero quantity.
Startup taints are removed — All taints in .spec.template.spec.startupTaints of the NodePool have been removed from node.spec.taints.

Common resources that prevent initialization:

nvidia.com/gpu: GPU instance launched but the device plugin DaemonSet is not installed.
vpc.amazonaws.com/pod-eni: Instance launched but ENABLE_POD_ENI is set to false in the vpc-cni plugin, so the resource is never registered.

Node NotReady

A node may start but fail to join the cluster and be marked NotReady. Common causes include misconfigured permissions, security groups, or networking.

Connect to the instance and check kubelet logs

For an AL2-based node:

# List nodes managed by Karpenter
kubectl get node -l karpenter.sh/nodepool

# Extract the instance ID
INSTANCE_ID=$(kubectl get node <node-name> -ojson | jq -r ".spec.providerID" | cut -d \/ -f5)

# Connect to the instance
aws ssm start-session --target $INSTANCE_ID

# Check kubelet logs
sudo journalctl -u kubelet

For a Bottlerocket node:

INSTANCE_ID=$(kubectl get node <node-name> -ojson | jq -r ".spec.providerID" | cut -d \/ -f5)
aws ssm start-session --target $INSTANCE_ID
enter-admin-container
journalctl -D /.bottlerocket/rootfs/var/log/journal -u kubelet.service

Check for CNI/IAM errors

If you see:

KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

This can reflect an IAM role permissions issue. See Amazon EKS node IAM role. Also check the CNI plugin logs:

kubectl get pods -n kube-system | grep aws-node
kubectl logs aws-node-????? -n kube-system

Check for API server authorization errors

If you see:

Unable to register node with API server" err="Unauthorized"
Failed to contact API server when waiting for CSINode publishing: Unauthorized

Check the aws-auth ConfigMap to ensure the Karpenter node role is mapped correctly:

kubectl get configmaps -n kube-system aws-auth -o yaml

The ConfigMap should include a mapRoles entry for your KarpenterNodeRole:

mapRoles: |
  - groups:
      - system:bootstrappers
      - system:nodes
    rolearn: arn:aws:iam::ACCOUNT_ID:role/KarpenterNodeRole-CLUSTER_NAME
    username: system:node:{{EC2PrivateDNSName}}

Collect logs for further analysis

If the issue persists, run the EKS Logs Collector (for EKS optimized AMIs) and review:

UserData: /var_log/cloud-init-output.log and /var_log/cloud-init.log
Kubelet logs: /kubelet/kubelet.log
Networking pod logs: /var_log/aws-node

Reach out on Slack or GitHub if you remain stuck.

Nodes stuck in pending due to outdated CNI

If an EC2 instance is launched but stuck in pending and the kubelet never starts, you may see this in /var/log/user-data.log:

No entry for c6i.xlarge in /etc/eks/eni-max-pods.txt

Your CNI plugin is out of date. Update it following the EKS VPC CNI update instructions.

Node terminates before ready on failed encrypted EBS volume

If you are using a custom launch template or Block Device Mappings with an encrypted EBS volume, the IAM principal launching the node may lack sufficient permissions to use the KMS customer managed key (CMK) for the root volume. The node terminates almost immediately upon creation.

EBS encryption may be enabled without your knowledge — an account administrator may have enabled it by default for a region. See Encryption by default.

Apply the following policy to your KMS key to allow all authorized principals in the account to use it via EBS:

[
  {
    "Sid": "Allow access through EBS for all principals in the account that are authorized to use EBS",
    "Effect": "Allow",
    "Principal": { "AWS": "*" },
    "Action": [
      "kms:Encrypt",
      "kms:Decrypt",
      "kms:ReEncrypt*",
      "kms:GenerateDataKey*",
      "kms:CreateGrant",
      "kms:DescribeKey"
    ],
    "Resource": "*",
    "Condition": {
      "StringEquals": {
        "kms:ViaService": "ec2.${AWS_REGION}.amazonaws.com",
        "kms:CallerAccount": "${AWS_ACCOUNT_ID}"
      }
    }
  },
  {
    "Sid": "Allow direct access to key metadata to the account",
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::${AWS_ACCOUNT_ID}:root"
    },
    "Action": [
      "kms:Describe*",
      "kms:Get*",
      "kms:List*",
      "kms:RevokeGrant"
    ],
    "Resource": "*"
  }
]

Node is not deleted even though ttlSecondsUntilExpired is set or the node is empty

This typically means the node has not been considered fully initialized. Check the Karpenter logs for a message like:

Inflight check failed for node...

This will provide more detail about what is preventing the node from being considered initialized.

'Expected resource vpc.amazonaws.com/pod-eni didn't register on the node'

The vpc.amazonaws.com/pod-eni resource was never reported on the node. You need to enable security groups for pods in the VPC CNI, which will cause this resource to be registered on nodes.

AWS Node Termination Handler (NTH) interactions

Karpenter does not currently support draining and terminating on spot rebalance recommendations. Users who want drain-and-terminate behavior for both spot interruptions and spot rebalance recommendations may install Node Termination Handler (NTH) alongside Karpenter.

Running both NTH and Karpenter with drain-and-terminate enabled for spot events can result in a loop: NTH removes a node for a spot rebalance recommendation, Karpenter re-launches the same instance type, which triggers another rebalance recommendation, and so on.

Karpenter does not recommend reacting to spot rebalance recommendations when running spot nodes. If you require this functionality, you can mitigate the loop by setting the following NTH values:

# Do not drain on spot interruption termination notice (IMDS mode only)
enableSpotInterruptionDraining: false

# Do not drain on rebalance recommendation (IMDS mode only)
enableRebalanceDraining: false

Alternatively, remove NTH entirely and rely on Karpenter’s built-in interruption handling.

EC2NodeClass validation

Force validation refresh

If you believe Karpenter’s EC2NodeClass validation cache is stale (for example, after updating IAM permissions), force a refresh by adding any annotation to the EC2NodeClass object. This causes Karpenter to re-validate and update its cache.

Pricing

Stale pricing data on isolated subnet

The following error occurs when Karpenter runs in an isolated private subnet with no internet egress via an IGW or NAT gateway:

ERROR   controller.aws.pricing  updating on-demand pricing, RequestError: send request failed
caused by: Post "https://api.pricing.us-east-1.amazonaws.com/": dial tcp 52.94.231.236:443: i/o timeout
caused by: ..., using existing pricing data from 2022-08-17T00:19:52Z

This timeout occurs because there is no VPC endpoint available for the Price List Query API.Karpenter ships updated on-demand pricing data as part of its binary, so pricing data is only refreshed on Karpenter version upgrades when the pricing API is unreachable.To suppress the error messages, set the AWS_ISOLATED_VPC environment variable (or the --aws-isolated-vpc flag) to true. See Environment variables and CLI flags for details.

Get Started

Concepts

Guides

Reference

Help

Controller

Installation

Uninstallation

Provisioning

Deprovisioning

Node launch and readiness

EC2NodeClass validation

Pricing

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

Help

Documentation Index

​Controller

​Installation

​Uninstallation

​Provisioning

​Deprovisioning

​Node launch and readiness

​EC2NodeClass validation

​Pricing

Build docs developers (and LLMs) love

Controller

Installation

Uninstallation

Provisioning

Deprovisioning

Node launch and readiness

EC2NodeClass validation

Pricing