Troubleshooting

Overview

This guide covers common issues you may encounter when operating the Kimbernetes cluster and how to resolve them.

Diagnostic Commands

Check Flux Status

# Check all Flux resources
flux check

# View Flux controller status
flux get all

# Check specific Kustomization
flux get kustomization flux-system

# Check HelmReleases
flux get helmreleases -A

View Logs

# View all Flux controller logs
flux logs --all-namespaces --level=error

# View specific controller logs
kubectl -n flux-system logs -l app=source-controller --tail=100
kubectl -n flux-system logs -l app=kustomize-controller --tail=100
kubectl -n flux-system logs -l app=helm-controller --tail=100

# Follow logs in real-time
flux logs --follow --kind=Kustomization --name=flux-system

Inspect Resources

# Describe a HelmRelease
kubectl -n flux-system describe helmrelease cert-manager

# Check pod status
kubectl get pods -A | grep -v Running

# View events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

Common Issues

Kustomization Reconciliation Failed

Symptoms:

Kustomization shows False status
Changes in Git are not applied to cluster
Error: “kustomize build failed”

Diagnosis:

flux get kustomization flux-system
kubectl -n flux-system describe kustomization flux-system
flux logs --kind=Kustomization --name=flux-system

Common Causes:

Invalid YAML syntax

# Validate YAML locally
kubectl kustomize overlays/kimawesome/

Fix syntax errors in your YAML files and commit.

Missing resource files

# Check kustomization.yaml references
cat overlays/base/myapp/kustomization.yaml

Ensure all referenced files exist.

Namespace doesn’t exist Create the namespace first or add it to your kustomization:

# overlays/base/myapp/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: myapp

Resolution:

Fix the issue in your Git repository
Commit and push changes

Force reconciliation:

flux reconcile kustomization flux-system --with-source

HelmRelease Failed to Install

Symptoms:

HelmRelease status shows False
Error: “installation failed” or “upgrade failed”
Application not running

Diagnosis:

flux get helmreleases -A
kubectl -n flux-system describe helmrelease cert-manager
flux logs --kind=HelmRelease --name=cert-manager -n flux-system

Common Causes:

Chart version not found
```
Error: chart version "=1.99.0" not found
```
Check available versions:
```
helm search repo cert-manager --versions
```
Update to a valid version in helm-release.yaml.

HelmRepository not ready

flux get sources helm -A

Reconcile the repository:

flux reconcile source helm cert-manager -n cert-manager

Invalid Helm values

# Test Helm values locally
helm template myapp charts/myapp --values test-values.yaml

Fix invalid values in the HelmRelease spec.

Resource conflicts

Error: rendered manifests contain a resource that already exists

Check for duplicate resources:

kubectl get all -A | grep myapp

Delete conflicting resources or adjust the HelmRelease.

Resolution:

Fix the issue in the HelmRelease definition
Commit and push

Force reconciliation:

flux reconcile helmrelease cert-manager -n flux-system

If stuck, uninstall and reinstall:

flux suspend helmrelease myapp -n flux-system
helm uninstall myapp -n myapp
flux resume helmrelease myapp -n flux-system

Flux Controllers Not Running

Symptoms:

No reconciliation happening
Flux pods in CrashLoopBackOff
Error: “connection refused” to Flux API

Diagnosis:

kubectl -n flux-system get pods
kubectl -n flux-system logs <pod-name>
flux check

Common Causes:

Resource limits Pods being OOMKilled:
```
kubectl -n flux-system describe pod <pod-name> | grep -A 5 "Last State"
```
Increase memory limits in cluster/kimawesome/flux-system/gotk-components.yaml.
Network policy blocking Check network policies:
```
kubectl -n flux-system get networkpolicies
```
Ensure allow-egress policy exists (already configured in Flux v2.7.5).

Image pull issues

kubectl -n flux-system describe pod <pod-name> | grep -A 5 "Events"

Check image registry connectivity.

Resolution:

# Restart Flux controllers
flux suspend kustomization flux-system
kubectl -n flux-system delete pod -l app=source-controller
kubectl -n flux-system delete pod -l app=kustomize-controller
flux resume kustomization flux-system

Git Repository Authentication Failed

Symptoms:

Error: “authentication required” or “permission denied”
GitRepository shows False status
No reconciliation from Git

Diagnosis:

flux get sources git
kubectl -n flux-system describe gitrepository flux-system

Resolution:

Check secret exists:

kubectl -n flux-system get secret flux-system

Verify SSH key:

kubectl -n flux-system get secret flux-system -o jsonpath='{.data.identity}' | base64 -d

Test Git access manually:
```
ssh -T [email protected]
```

Regenerate deploy key:

flux create secret git flux-system \
  --url=ssh://[email protected]/kim-ae/kimbernetes-k8s-flux \
  --namespace=flux-system

Add the public key to GitHub repository deploy keys.

Certificate Issues

Symptoms:

Ingress shows certificate errors
Error: “certificate not ready”
cert-manager pods failing

Diagnosis:

kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>
kubectl get certificaterequests -A
kubectl -n cert-manager logs -l app=cert-manager

Common Causes:

DNS not propagated Wait for DNS to propagate:
```
nslookup version-management.kim.tec.br
```

HTTP01 challenge failed Check ingress is accessible:

curl -v http://version-management.kim.tec.br/.well-known/acme-challenge/test

Rate limit hit Let’s Encrypt rate limits reached. Wait or use staging:
```
spec:
  issuerRef:
    name: letsencrypt-staging
```

Resolution:

Delete failed certificate:

kubectl delete certificate <cert-name> -n <namespace>

Delete certificate request:

kubectl delete certificaterequest --all -n <namespace>

Let cert-manager retry automatically

Network Connectivity Issues

Symptoms:

Pods cannot reach external services
DNS resolution failing
Inter-pod communication broken

Diagnosis:

# Test DNS
kubectl run test --image=busybox --rm -it -- nslookup google.com

# Test external connectivity
kubectl run test --image=curlimages/curl --rm -it -- curl -v https://google.com

# Check Cilium status
cilium status
cilium connectivity test

# Check network policies
kubectl get networkpolicies -A

Resolution:

Restart Cilium:

kubectl -n kube-system rollout restart deployment/cilium-operator
kubectl -n kube-system rollout restart daemonset/cilium

Check IP forwarding:

sysctl net.ipv4.ip_forward
# Should be 1

Verify CoreDNS:

kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system logs -l k8s-app=kube-dns

Sealed Secrets Decryption Failed

Symptoms:

SealedSecret exists but Secret not created
Error: “no key could decrypt secret”

Diagnosis:

kubectl get sealedsecrets -A
kubectl -n sealed-secrets logs -l app.kubernetes.io/name=sealed-secrets

Common Causes:

Sealed secret encrypted with wrong key Re-encrypt with current cluster key:

kubeseal --fetch-cert > pub-cert.pem
kubectl create secret generic mysecret --dry-run=client -o yaml | \
  kubeseal --cert=pub-cert.pem -o yaml > sealed-secret.yaml

Sealed secrets controller not ready

kubectl -n sealed-secrets get pods
flux reconcile helmrelease sealed-secrets -n sealed-secrets

Resolution: See Backup and Restore for recovering the private key.

Resource Debugging

Pod Failing to Start

# Check pod status
kubectl get pod <pod-name> -n <namespace>

# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container

# Check resource constraints
kubectl top pod <pod-name> -n <namespace>

Service Not Accessible

# Check service
kubectl get svc <service-name> -n <namespace>

# Check endpoints
kubectl get endpoints <service-name> -n <namespace>

# Test from within cluster
kubectl run test --image=curlimages/curl --rm -it -- \
  curl -v http://<service-name>.<namespace>.svc.cluster.local

Ingress Not Working

# Check ingress
kubectl get ingress -A
kubectl describe ingress <ingress-name> -n <namespace>

# Check gateway
kubectl get gateway -A
kubectl describe gateway <gateway-name> -n <namespace>

# Check if service is accessible
kubectl port-forward svc/<service-name> 8080:80 -n <namespace>
curl http://localhost:8080

Performance Issues

High Memory Usage

# Check resource usage
kubectl top nodes
kubectl top pods -A --sort-by=memory

# Identify memory leaks
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

Slow Reconciliation

# Check Flux controller resources
kubectl -n flux-system top pods

# Increase controller resources in gotk-components.yaml
# Adjust interval in Kustomizations:
spec:
  interval: 5m  # Increase from 1m if too frequent

Emergency Procedures

Rollback a Change

# Find the bad commit
git log --oneline

# Revert the commit
git revert <commit-hash>
git push origin main

# Or reset to previous commit (dangerous)
git reset --hard <good-commit-hash>
git push --force origin main

# Force immediate reconciliation
flux reconcile kustomization flux-system --with-source

Bypass Flux Temporarily

# Suspend Flux reconciliation
flux suspend kustomization flux-system

# Make manual changes
kubectl apply -f emergency-fix.yaml

# Resume when ready
flux resume kustomization flux-system

Manual changes will be reverted by Flux once resumed. Always update Git to make changes permanent.

Getting Help

Check Flux documentation
View Flux GitHub issues

Enable debug logging:

flux logs --all-namespaces --level=debug

Getting Started

Setup & Deployment

Architecture

Infrastructure Components

Observability

Applications

Operations

Overview

Diagnostic Commands

Check Flux Status

View Logs

Inspect Resources

Common Issues

Resource Debugging

Pod Failing to Start

Service Not Accessible

Ingress Not Working

Performance Issues

High Memory Usage

Slow Reconciliation

Emergency Procedures

Rollback a Change

Bypass Flux Temporarily

Getting Help

Next Steps

Build docs developers (and LLMs) love

Getting Started

Setup & Deployment

Architecture

Infrastructure Components

Observability

Applications

Operations

​Overview

​Diagnostic Commands

​Check Flux Status

​View Logs

​Inspect Resources

​Common Issues

​Resource Debugging

​Pod Failing to Start

​Service Not Accessible

​Ingress Not Working

​Performance Issues

​High Memory Usage

​Slow Reconciliation

​Emergency Procedures

​Rollback a Change

​Bypass Flux Temporarily

​Getting Help

​Next Steps

Build docs developers (and LLMs) love

Overview

Diagnostic Commands

Check Flux Status

View Logs

Inspect Resources

Common Issues

Resource Debugging

Pod Failing to Start

Service Not Accessible

Ingress Not Working

Performance Issues

High Memory Usage

Slow Reconciliation

Emergency Procedures

Rollback a Change

Bypass Flux Temporarily

Getting Help

Next Steps