Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/loft-sh/vcluster/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This guide provides systematic troubleshooting approaches for common vCluster issues. Use the diagnostic commands and solutions to quickly identify and resolve problems.

General Troubleshooting Approach

1

Identify the Issue

Clearly define what’s not working:
  • What were you trying to do?
  • What happened instead?
  • When did it start?
  • Has it ever worked?
2

Gather Information

Collect diagnostic data:
vcluster debug collect my-vcluster --namespace production
3

Check Basics

Verify fundamental components:
vcluster list
kubectl get pods -n production -l release=my-vcluster
kubectl logs -n production -l app=vcluster,release=my-vcluster --tail=100
4

Review Changes

What changed recently?
  • Configuration updates
  • Version upgrades
  • Infrastructure changes
  • New deployments
5

Isolate the Problem

Test components individually:
  • Host cluster connectivity
  • Virtual cluster API server
  • Resource syncing
  • Network policies
6

Apply Solution

Try fixes from most to least invasive:
  1. Configuration changes
  2. Pod restarts
  3. Resource recreation
  4. Full restoration from backup

Common Issues and Solutions

Connection and Access Issues

Symptoms:
  • vcluster connect hangs or times out
  • Connection refused errors
  • Authentication failures
Diagnosis:
# Check if vCluster pods are running
kubectl get pods -n production -l release=my-vcluster

# Check service endpoints
kubectl get svc -n production -l release=my-vcluster
kubectl get endpoints -n production

# Check port-forward manually
kubectl port-forward -n production svc/my-vcluster 8443:443
Solutions:
  1. Restart vCluster pods:
    kubectl rollout restart statefulset -n production my-vcluster
    
  2. Check service account permissions:
    kubectl get sa -n production
    kubectl describe sa vc-my-vcluster -n production
    
  3. Verify network policies:
    kubectl get networkpolicy -n production
    kubectl describe networkpolicy -n production
    
  4. Check for resource constraints:
    kubectl describe pod -n production -l release=my-vcluster
    kubectl top pods -n production
    
  5. Try reconnecting with verbose output:
    vcluster connect my-vcluster --namespace production --debug
    
Symptoms:
  • Commands hang indefinitely
  • Timeouts after several seconds
  • Intermittent connectivity
Diagnosis:
# Test API server health
kubectl get --raw /healthz
kubectl get --raw /readyz

# Check API server logs
kubectl logs -n production -l app=vcluster,release=my-vcluster | grep -i error

# Test specific operations
time kubectl get nodes
time kubectl get pods
Solutions:
  1. Increase timeout:
    kubectl get pods --request-timeout=60s
    
  2. Check etcd performance:
    # Access vCluster pod
    kubectl exec -it -n production my-vcluster-0 -- sh
    
    # Inside pod, check etcd
    ETCDCTL_API=3 etcdctl --endpoints=https://localhost:2379 \
      --cert=/pki/etcd/tls.crt \
      --key=/pki/etcd/tls.key \
      --cacert=/pki/etcd/ca.crt \
      endpoint health
    
  3. Reduce cluster load:
    • Scale down non-critical workloads
    • Check for resource-intensive operations
    • Review API server logs for high-frequency requests
  4. Increase API server resources:
    # vcluster.yaml
    controlPlane:
      statefulSet:
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 2Gi
    
Symptoms:
  • “Unauthorized” or “Forbidden” errors
  • Permission denied messages
  • RBAC violations
Diagnosis:
# Check current user
kubectl auth whoami

# Test permissions
kubectl auth can-i get pods
kubectl auth can-i create deployments
kubectl auth can-i '*' '*' --all-namespaces

# View role bindings
kubectl get rolebindings,clusterrolebindings -o wide
Solutions:
  1. Verify kubeconfig context:
    kubectl config current-context
    kubectl config view
    
  2. Reconnect to vCluster:
    vcluster disconnect
    vcluster connect my-vcluster --namespace production
    
  3. Check certificate validity:
    # View certificate details
    kubectl config view --raw -o jsonpath='{.users[0].user.client-certificate-data}' | \
      base64 -d | openssl x509 -text -noout
    
  4. Grant necessary permissions:
    # Create role binding
    kubectl create rolebinding dev-admin \
      --clusterrole=admin \
      --user=user@example.com \
      --namespace=default
    

Resource Syncing Issues

Symptoms:
  • Resources created in vCluster don’t appear in host namespace
  • Pods stay pending indefinitely
  • Services not accessible from host
Diagnosis:
# Check syncer logs
kubectl logs -n production -l app=vcluster,release=my-vcluster -c syncer

# Compare resources
vcluster connect my-vcluster --namespace production
kubectl get pods -o wide
vcluster disconnect
kubectl get pods -n production

# Check sync configuration
vcluster describe my-vcluster --config-only
Solutions:
  1. Verify sync configuration:
    # vcluster.yaml
    sync:
      toHost:
        pods:
          enabled: true
        services:
          enabled: true
        persistentVolumeClaims:
          enabled: true
    
  2. Restart syncer:
    kubectl delete pod -n production -l app=vcluster,release=my-vcluster
    
  3. Check service account permissions:
    kubectl auth can-i create pods \
      --as=system:serviceaccount:production:vc-my-vcluster \
      -n production
    
  4. Review resource quotas:
    kubectl get resourcequota -n production
    kubectl describe resourcequota -n production
    
  5. Check for naming conflicts:
    # Synced resources have specific naming patterns
    kubectl get pods -n production -o jsonpath='{.items[*].metadata.name}'
    
Symptoms:
  • Pods created but never start
  • Status remains “Pending”
  • Containers not running
Diagnosis:
# Check pod status and events
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

# Check node resources
kubectl top nodes
kubectl describe nodes

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>
Solutions:
  1. Insufficient resources:
    # Scale down other workloads or add nodes
    kubectl scale deployment other-app --replicas=0
    
  2. PVC not bound:
    # Check storage class
    kubectl get storageclass
    
    # Create PVC manually if needed
    kubectl apply -f pvc.yaml
    
  3. Image pull failures:
    # Check image pull secrets
    kubectl get secrets
    kubectl describe pod <pod-name> | grep -A 10 "Events"
    
    # Test image accessibility
    kubectl run test --image=<image> --dry-run=client -o yaml
    
  4. Node selectors/taints:
    # Check node selectors and taints
    kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
    
    # Remove taint if needed
    kubectl taint nodes <node-name> key:NoSchedule-
    
Symptoms:
  • Services created but not reachable
  • Connection refused or timeout
  • DNS resolution failures
Diagnosis:
# Check service and endpoints
kubectl get svc,endpoints
kubectl describe svc <service-name>

# Test DNS resolution
kubectl run test-dns --image=busybox --rm -it -- nslookup <service-name>

# Check network policies
kubectl get networkpolicy
kubectl describe networkpolicy
Solutions:
  1. Verify service sync:
    # vcluster.yaml
    sync:
      toHost:
        services:
          enabled: true
    
  2. Check endpoints:
    kubectl get endpoints <service-name>
    # Should show pod IPs
    
  3. Test connectivity:
    # From within cluster
    kubectl run test-curl --image=curlimages/curl --rm -it -- \
      curl http://<service-name>:<port>
    
  4. Review network policies:
    # Temporarily remove network policies to test
    kubectl delete networkpolicy --all
    # Test connectivity, then restore policies
    

Performance Issues

Symptoms:
  • Slow response times
  • OOMKilled pods
  • Throttling warnings
Diagnosis:
# Check current usage
kubectl top pods -n production -l release=my-vcluster
kubectl top nodes

# Get resource limits
kubectl get pod -n production -l release=my-vcluster \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'

# Check for memory leaks
kubectl exec -it -n production my-vcluster-0 -- top -b -n 1
Solutions:
  1. Increase resource limits:
    # vcluster.yaml
    controlPlane:
      statefulSet:
        resources:
          limits:
            cpu: 2000m
            memory: 4Gi
          requests:
            cpu: 500m
            memory: 1Gi
    
  2. Enable resource limiting in virtual cluster:
    policies:
      resourceQuota:
        enabled: true
        quota:
          requests.cpu: "10"
          requests.memory: 20Gi
          limits.cpu: "20"
          limits.memory: 40Gi
    
  3. Optimize workload placement:
    # Use node affinity to spread load
    kubectl patch deployment app \
      -p '{"spec":{"template":{"spec":{"affinity":{"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"app","operator":"In","values":["app"]}]},"topologyKey":"kubernetes.io/hostname"}}]}}}}}}'
    
  4. Profile and optimize:
    # Enable profiling
    kubectl port-forward -n production my-vcluster-0 6060:6060
    # Access http://localhost:6060/debug/pprof/
    
Symptoms:
  • kubectl commands take long to complete
  • API timeouts
  • Unresponsive control plane
Diagnosis:
# Measure API latency
time kubectl get nodes
time kubectl get pods --all-namespaces

# Check API server metrics
kubectl get --raw /metrics | grep apiserver_request_duration

# Check etcd performance
kubectl logs -n production my-vcluster-0 | grep -i "etcd"
Solutions:
  1. Increase API server resources (see above)
  2. Optimize etcd:
    # vcluster.yaml
    controlPlane:
      backingStore:
        etcd:
          embedded:
            enabled: true
            migrateFromDeployedEtcd: true
        database:
          # Or use external database
          external:
            enabled: true
            endpoint: postgres://...
    
  3. Reduce API load:
    # Find high-frequency API callers
    kubectl logs -n production my-vcluster-0 | \
      grep "requestInfo" | \
      awk '{print $NF}' | sort | uniq -c | sort -rn | head -20
    
  4. Enable API priority and fairness:
    controlPlane:
      statefulSet:
        enableServiceLinks: false
    

Stability Issues

Symptoms:
  • Pods restarting repeatedly
  • CrashLoopBackOff status
  • High restart count
Diagnosis:
# Check restart count
kubectl get pods -n production -l release=my-vcluster

# View crash logs
kubectl logs -n production my-vcluster-0 --previous

# Check for OOM kills
kubectl describe pod -n production my-vcluster-0 | grep -A 5 "Last State"

# Review events
kubectl get events -n production --sort-by='.lastTimestamp' | grep my-vcluster
Solutions:
  1. OOM kills - increase memory:
    controlPlane:
      statefulSet:
        resources:
          limits:
            memory: 4Gi
    
  2. Liveness probe too aggressive:
    controlPlane:
      statefulSet:
        probes:
          livenessProbe:
            initialDelaySeconds: 60
            periodSeconds: 20
            failureThreshold: 5
    
  3. Application errors - check logs:
    kubectl logs -n production my-vcluster-0 --previous | tail -100
    
  4. Resource contention:
    # Check node pressure
    kubectl describe nodes | grep -A 5 "Conditions"
    
Symptoms:
  • Resources disappearing
  • Configuration resets
  • State inconsistencies
Diagnosis:
# Check PVC status
kubectl get pvc -n production
kubectl describe pvc -n production

# Verify backup storage
kubectl logs -n production my-vcluster-0 | grep -i "backup\|snapshot\|etcd"

# Check for volume mount issues
kubectl describe pod -n production my-vcluster-0 | grep -A 10 "Volumes"
Solutions:
  1. Restore from backup:
    vcluster restore my-vcluster \
      oci://ghcr.io/my-org/backups:latest \
      --namespace production
    
  2. Fix PVC issues:
    # Check storage class
    kubectl get storageclass
    kubectl describe storageclass
    
    # Recreate PVC if corrupted
    kubectl delete pvc data-my-vcluster-0 -n production
    kubectl rollout restart statefulset my-vcluster -n production
    
  3. Enable persistent storage:
    controlPlane:
      backingStore:
        etcd:
          embedded:
            enabled: true
          persistence:
            enabled: true
            size: 10Gi
            storageClass: fast-ssd
    

Advanced Debugging

Enable Debug Logging

# vcluster.yaml
controlPlane:
  statefulSet:
    env:
    - name: DEBUG
      value: "true"
    - name: LOG_LEVEL
      value: "debug"

Interactive Debugging Shell

# Shell into control plane pod
vcluster debug shell my-vcluster --namespace production

# Or use kubectl directly
kubectl exec -it -n production my-vcluster-0 -- /bin/sh

Network Debugging

# Deploy debug pod
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside debug pod:
ping <service-name>
nslookup <service-name>
curl http://<service-name>:<port>
traceroute <service-name>

Collect Comprehensive Debug Info

# Generate debug bundle
vcluster debug collect my-vcluster \
  --namespace production \
  --output-filename debug-$(date +%Y%m%d-%H%M%S).tar.gz

# Extract and review
tar -xzf debug-*.tar.gz
cd debug/
ls -R

Getting Help

If you’re still experiencing issues:

GitHub Issues

Search or create an issue: github.com/loft-sh/vcluster/issues

Slack Community

Join the community: vcluster.com/slack

Documentation

Browse docs: vcluster.com/docs

Support

Enterprise support: Contact your account team

When Reporting Issues

Include:
  1. Environment details:
    • vCluster version
    • Kubernetes version (host and virtual)
    • Cloud provider/platform
    • Installation method (Helm, Platform, etc.)
  2. Reproduction steps:
    • What you did
    • What you expected
    • What actually happened
  3. Debug information:
    • Output of vcluster debug collect
    • Relevant logs
    • Configuration files (sanitized)
    • Error messages
  4. Attempted solutions:
    • What you’ve tried
    • Results of each attempt

Next Steps

Monitoring

Set up monitoring to catch issues early

Managing vClusters

Return to general management operations

Build docs developers (and LLMs) love