Controller
Enable debug logging
Enable debug logging
LOG_LEVEL environment variable on the Karpenter deployment, then restart it.You can also enable debug logging at install time with Helm:Installation
Missing service linked role
Missing service linked role
ServiceLinkedRoleCreationNotPermitted.Failed resolving STS credentials with I/O timeout
Failed resolving STS credentials with I/O timeout
dnsPolicy: ClusterFirst and your in-cluster DNS service is not yet running.You have two options to resolve this:- Let Karpenter manage your in-cluster DNS service — Change Karpenter’s
dnsPolicytoDefault(--set dnsPolicy=Defaultwith Helm). This causes Karpenter to use the VPC DNS service directly, allowing it to start up without the DNS application pods running. - Let MNG/Fargate manage your in-cluster DNS service — If running with MNG, ensure the node group has enough capacity to support the DNS application pods with the correct tolerations. If running with Fargate, ensure you have a Fargate profile that selects the DNS application pods.
Karpenter role names exceeding 64-character limit
Karpenter role names exceeding 64-character limit
KarpenterNodeRole-${Cluster_Name}. If a long cluster name causes this to exceed 64 characters, object creation will fail.Unknown field in NodePool or EC2NodeClass spec
Unknown field in NodePool or EC2NodeClass spec
Unable to schedule pod due to insufficient node group instances
Unable to schedule pod due to insufficient node group instances
0.16.0 changed the default replica count from 1 to 2. Karpenter will not launch capacity to run itself (due to the karpenter.sh/nodepool DoesNotExist requirement), so it cannot provision for the second Karpenter pod.To resolve this, either:- Reduce the replica count from 2 to 1, or
- Ensure there is enough non-Karpenter-managed capacity to run both pods. On AWS, increase the
minimumanddesiredparameters on the node group autoscaling group to launch at least 2 instances.
Helm error when pulling the chart
Helm error when pulling the chart
- Ensure you are using Helm
3.8.0or newer (OCI image support was added in this release). - Helm does not have a
helm repo addconcept for OCI, so you no longer need that step. - If you see
Error: public.ecr.aws/karpenter/karpenter:0.34.0: not found, add avprefix for Karpenter versions between0.17.0and0.34.x. - Verify the image exists in gallery.ecr.aws/karpenter.
- Add the
--debugflag to Helm commands for more verbose error messages. - For
403 Forbiddenerrors, rundocker logout public.ecr.awsas described in the ECR troubleshooting docs.
Helm error when installing the karpenter-crd chart
Helm error when installing the karpenter-crd chart
0.26.1 introduced the karpenter-crd Helm chart. If you previously added Karpenter CRDs to your cluster through the controller chart or via kubectl replace, Helm will reject the install due to invalid ownership metadata.For invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by":annotation validation error: missing key "meta.helm.sh/release-namespace":Uninstallation
Unable to delete nodes after uninstalling Karpenter
Unable to delete nodes after uninstalling Karpenter
-
Edit the node manually and remove the
karpenter.sh/terminationline from the finalizers field: -
Or run the following script to remove the finalizer from all Karpenter-managed nodes:
Provisioning
Instances with swap volumes fail to register with the control plane
Instances with swap volumes fail to register with the control plane
c1.medium and m1.small) are configured with a swap volume, which causes the kubelet to fail on launch:DaemonSets can result in deployment failures
DaemonSets can result in deployment failures
0.5.3 and earlier, DaemonSets were not properly considered when provisioning nodes, sometimes causing nodes to be deployed that could not meet DaemonSet and workload requirements. This issue was resolved in 0.5.3 (PR #1155).If you are on a pre-0.5.3 version, a workaround is to configure your NodePool to only use larger instance types that you know will be big enough for the DaemonSet and the workload.Unspecified resource requests cause scheduling/bin-pack failures
Unspecified resource requests cause scheduling/bin-pack failures
kube-scheduler.Pods using Security Groups for Pods stuck in ContainerCreating for up to 30 minutes
Pods using Security Groups for Pods stuck in ContainerCreating for up to 30 minutes
ContainerCreating for up to 30 minutes before transitioning to Running. This is caused by an interaction between Karpenter and the amazon-vpc-resource-controller when a pod requests vpc.amazonaws.com/pod-eni resources.As a workaround, add the vpc.amazonaws.com/has-trunk-attached: "false" label to your NodePool spec and ensure instance type requirements include instance types that support ENI trunking:Pods using PVCs can hit volume limits and fail to scale-up
Pods using PVCs can hit volume limits and fail to scale-up
kubernetes.io/aws-ebs or a PV with AWSElasticBlockStore, Karpenter cannot discover volume attachment limits and may schedule too many pods to a node. You will see log messages like:ebs.csi.aws.com).Race condition between the scheduler and CSINodeDue to a race condition in Kubernetes, the scheduler may assume a node can mount more volumes than it actually supports. Enforce topologySpreadConstraints and podAntiAffinity on workloads using PVCs to reduce co-location.The following CSI drivers support a startupTaint to eliminate this race:Configure these via startupTaints on your NodePool. For EBS:CNI is unable to allocate IPs to pods
CNI is unable to allocate IPs to pods
maxPods is greater than the node’s supported pod densityThe number of pods on a node is limited by the number of ENIs attachable to the instance type and the number of IPs per ENI. If maxPods in your EC2NodeClass kubeletConfiguration exceeds the supported IP count for a given instance type, the CNI will fail to assign an IP and pods will be stuck in ContainerCreating.If you have enabled Security Groups per Pod, one ENI is reserved as the trunk interface, which further reduces the available IPs. Karpenter does not account for this reservation.To resolve:- Enable Prefix Delegation to increase allocatable IPs per ENI.
- Reduce your
maxPodsvalue to be within the instance type’s pod density limit. - Remove the
maxPodsvalue fromkubeletConfigurationto rely on Karpenter and EKS AMI defaults. - Set
RESERVED_ENIS=1in your Karpenter configuration when using Security Groups for Pods.
ContainerCreating until an IP is freed.To resolve:- Use
topologySpreadConstraintsontopology.kubernetes.io/zoneto spread pods and nodes across zones. - Increase the IP address space (CIDR) for subnets in your EC2NodeClass.
- Use custom networking to assign separate IP spaces to pods and nodes.
- Run your EKS cluster on IPv6.
Windows pods failing with FailedCreatedPodSandbox
Windows pods failing with FailedCreatedPodSandbox
Windows pods fail to launch with image pull error
Windows pods fail to launch with image pull error
nodeSelector to ensure containers are scheduled on a compatible host OS version. See Windows container version compatibility for details.Windows pods unable to resolve DNS
Windows pods unable to resolve DNS
eks:kube-proxy-windows:kube-proxy runs as a process on the node and needs RBAC cluster permissions to access required resources. See the EKS Windows support docs for more information.Karpenter incorrectly computes available resources for a node
Karpenter incorrectly computes available resources for a node
nodeClaim.status.allocatable) may not always match the actual allocatable resources on the node (node.status.allocatable) due to memory reserved by the hypervisor and OS.Karpenter uses ec2:DescribeInstanceTypes and a cache of observed memory capacity. For the first launch of a given instance type + AMI pair, the VM_MEMORY_OVERHEAD_PERCENT setting is used as a fallback (default: 7.5%). After a node is created, the actual memory capacity is cached and used for future launches of the same pair.The default 7.5% value is tuned to avoid overestimation across most instance types, meaning Karpenter will typically slightly underestimate available memory. If you know the exact overhead for your instances, you can tune this value, but do so with caution: overestimating memory can cause Karpenter to launch nodes that are too small for your workloads.To detect cases where Karpenter is overestimating resource availability, monitor this status condition:Karpenter is unable to satisfy topology spread constraints
Karpenter is unable to satisfy topology spread constraints
TopologySpreadConstraints, Karpenter derives eligible domains from the pod’s requirements — not from the compatible NodePools. This can result in Karpenter attempting to provision capacity in domains that no compatible NodePool can actually serve.For example, if a pod has no zonal constraints but its only compatible NodePool is restricted to two out of three zones, Karpenter will succeed for the first two replicas but fail for any replica that requires placement in the third zone.To resolve this, ensure that all eligible domains for a pod can be provisioned by compatible NodePools, or add matching zonal constraints to the pod spec:'No instance type met the scheduling requirements or had a required offering'
'No instance type met the scheduling requirements or had a required offering'
- The pod has resource requests that require a minimum instance size, but the NodePool is restricted to an instance family or size that cannot satisfy them.
- DaemonSet resource requests are accounted for when evaluating instance compatibility and may push the minimum required size above what is available.
- The pod is restricted to a specific availability zone where the required capacity type is not available. This commonly happens with StatefulSet pods that had an EBS volume attached in a different AZ than the one currently being targeted.
Deprovisioning
Nodes not deprovisioned
Nodes not deprovisioned
karpenter.sh/initialized label set. If this label is absent, the node will not be deprovision-eligible. See the Nodes not initialized section for details.Pod Disruption Budgets (PDBs)Karpenter respects PDBs using a backoff retry eviction strategy. Pods that fail to shut down will block node deprovisioning. For example:karpenter.sh/do-not-disrupt annotationIf any pod on a node has the annotation karpenter.sh/do-not-disrupt: "true", Karpenter will not drain pods from or delete that node. To resume deprovisioning, remove the annotation from the pod.Scheduling constraints (consolidation only)Consolidation will not proceed if its scheduling simulation determines that the pods on a node cannot run on other nodes due to inter-pod affinity/anti-affinity, topology spread constraints, or other scheduling restrictions.Node launch and readiness
Node not created
Node not created
Nodes not initialized
Nodes not initialized
- Node readiness — The
Readycondition type isTrue. - Expected resources are registered — All expected resources from
ec2:DescribeInstanceTypesappear innode.status.allocatablewith a non-zero quantity. - Startup taints are removed — All taints in
.spec.template.spec.startupTaintsof the NodePool have been removed fromnode.spec.taints.
nvidia.com/gpu: GPU instance launched but the device plugin DaemonSet is not installed.vpc.amazonaws.com/pod-eni: Instance launched butENABLE_POD_ENIis set tofalsein thevpc-cniplugin, so the resource is never registered.
Node NotReady
Node NotReady
NotReady. Common causes include misconfigured permissions, security groups, or networking.Check for CNI/IAM errors
Check for API server authorization errors
aws-auth ConfigMap to ensure the Karpenter node role is mapped correctly:mapRoles entry for your KarpenterNodeRole:Collect logs for further analysis
- UserData:
/var_log/cloud-init-output.logand/var_log/cloud-init.log - Kubelet logs:
/kubelet/kubelet.log - Networking pod logs:
/var_log/aws-node
Nodes stuck in pending due to outdated CNI
Nodes stuck in pending due to outdated CNI
/var/log/user-data.log:Node terminates before ready on failed encrypted EBS volume
Node terminates before ready on failed encrypted EBS volume
Node is not deleted even though ttlSecondsUntilExpired is set or the node is empty
Node is not deleted even though ttlSecondsUntilExpired is set or the node is empty
'Expected resource vpc.amazonaws.com/pod-eni didn't register on the node'
'Expected resource vpc.amazonaws.com/pod-eni didn't register on the node'
vpc.amazonaws.com/pod-eni resource was never reported on the node. You need to enable security groups for pods in the VPC CNI, which will cause this resource to be registered on nodes.AWS Node Termination Handler (NTH) interactions
AWS Node Termination Handler (NTH) interactions
EC2NodeClass validation
Force validation refresh
Force validation refresh
Pricing
Stale pricing data on isolated subnet
Stale pricing data on isolated subnet
AWS_ISOLATED_VPC environment variable (or the --aws-isolated-vpc flag) to true. See Environment variables and CLI flags for details.