Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ops-north/shipyard/llms.txt
Use this file to discover all available pages before exploring further.
This guide covers common issues you may encounter when deploying and managing Shipyard infrastructure, along with their solutions.
Tailscale Issues
Tailscale Not Connecting
Symptoms:
- Subnet router doesn’t appear in Tailscale admin console
- Device shows as offline
- Cannot ping VPC private IPs
Solutions:
Check instance has internet access
Verify the Tailscale router instance is in a public subnet with internet gateway access:# SSH to the instance (if possible)
aws ec2-instance-connect ssh --instance-id i-xxxxx
# Test internet connectivity
ping 8.8.8.8
Verify auth key is valid
Check that your auth key:
- Hasn’t expired
- Is properly set in environment variables
- Has correct permissions and tags
Create a new key at Tailscale Admin if needed. Check Tailscale service logs
# SSH to the instance
sudo journalctl -u tailscaled -f
# Check user-data script execution
cat /var/log/user-data.log
Restart Tailscale service
sudo systemctl restart tailscaled
sudo tailscale up --authkey=$TAILSCALE_AUTH_KEY --advertise-routes=10.0.0.0/16
Subnet Routes Not Working
Symptoms:
- Can see subnet router in Tailscale admin
- Cannot ping VPC private IPs
- kubectl cannot connect to EKS
Solutions:
Verify routes are approved
- Go to Tailscale Machines
- Find your subnet router
- Check that subnet routes are shown and approved
- If not approved, click “Review” and approve them manually
Ensure your Tailscale ACL includes:{
"autoApprovers": {
"routes": {
"10.0.0.0/8": ["tag:aws-router"],
"172.16.0.0/12": ["tag:aws-router"],
"192.168.0.0/16": ["tag:aws-router"]
}
},
"tagOwners": {
"tag:aws-router": ["autogroup:admin"]
}
}
Update at Tailscale ACLs
Check that the subnet router security group allows:
- Outbound: All traffic to 0.0.0.0/0
- Inbound: All traffic from VPC CIDR (10.0.0.0/16)
EKS Issues
EKS API Not Accessible
Symptoms:
kubectl commands timeout
- “Unable to connect to the server” errors
- Connection refused errors
Solutions:
Verify Tailscale connection
tailscale status
# Test VPC connectivity
ping 10.0.1.10
Update kubeconfig
aws eks update-kubeconfig --name dev-eks-cluster --region us-east-2
Verify AWS credentials
aws sts get-caller-identity
Ensure the returned identity has EKS access.Check cluster endpoint
aws eks describe-cluster --name dev-eks-cluster --region us-east-2 \
--query 'cluster.endpoint' --output text
Verify this is a private endpoint within your VPC.Verify security group rules
Check that the EKS cluster security group allows:
- Port 443 from VPC CIDR
- All traffic from node security group
Pods Not Starting
Symptoms:
- Pods stuck in
Pending state
ImagePullBackOff errors
CrashLoopBackOff errors
Solutions:
kubectl describe pod <pod-name> -n <namespace>
Look for errors in the Events section.
# Check node resources
kubectl top nodes
# Check pod resource requests
kubectl describe pod <pod-name> -n <namespace> | grep -A5 Requests
If nodes are at capacity, scale your node group.
# Check if image exists
docker pull <image-name>
# Verify imagePullSecrets if using private registry
kubectl get secrets -n <namespace>
# Check pod logs
kubectl logs <pod-name> -n <namespace>
# Check previous instance logs if pod is restarting
kubectl logs <pod-name> -n <namespace> --previous
Vault Issues
Vault Not Initializing
Symptoms:
- Vault pods show
0/1 ready
vault status shows sealed
- Initialization fails
Solutions:
Check pod status
kubectl get pods -n vault
kubectl describe pod vault-0 -n vault
Check pod logs
kubectl logs -n vault vault-0
Look for errors related to KMS or DynamoDB.Verify KMS key permissions
Ensure the EKS node IAM role has permissions:
kms:Decrypt
kms:Encrypt
kms:DescribeKey
On the KMS key used for Vault auto-unseal. Verify DynamoDB access
# Check if table exists
aws dynamodb describe-table --table-name vault-storage-dev
Ensure the node IAM role has DynamoDB permissions.Manual initialization (if needed)
# Exec into Vault pod
kubectl exec -n vault vault-0 -- vault operator init
# Save the output securely!
Vault Sealed
Symptoms:
- Vault status shows
Sealed: true
- Applications cannot access secrets
Solutions:
# Check Vault status
kubectl exec -n vault vault-0 -- vault status
# Vault should auto-unseal with KMS
# If it remains sealed, check KMS permissions
# Check Vault logs for unseal errors
kubectl logs -n vault vault-0 | grep -i unseal
# Restart Vault pod if needed
kubectl delete pod -n vault vault-0
Certificate Issues
Certificates Not Issuing
Symptoms:
- Certificate shows
Ready: False
- Let’s Encrypt challenges fail
- TLS errors when accessing services
Solutions:
Check certificate status
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>
Check challenges
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n <namespace>
Verify Cloudflare API token
# Test API token
curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN"
Ensure the token has:
- Zone:DNS:Edit
- Zone:Zone:Read
Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
Check rate limits
Let’s Encrypt has rate limits:
- 50 certificates per domain per week
- 5 failed validations per hour
Use staging issuer for testing:issuerRef:
name: letsencrypt-staging
kind: ClusterIssuer
State Lock Errors
Symptoms:
- “Error acquiring the state lock”
- Terraform operations hang
Solutions:
# Check lock status
aws dynamodb get-item --table-name shipyard-terraform-locks-dev \
--key '{"LockID": {"S": "<lock-id>"}}'
# Force unlock (use with caution!)
terraform force-unlock <lock-id>
Only use force-unlock if you’re certain no other Terraform process is running.
Resource Already Exists
Symptoms:
- “AlreadyExists” errors
- “Resource already exists” errors
Solutions:
terraform import <resource-type>.<resource-name> <resource-id>
Example:terraform import aws_s3_bucket.state_bucket shipyard-terraform-state-dev
If the resource should not be managed by Terraform:terraform state rm <resource-type>.<resource-name>
If there’s a naming conflict, update the resource name in your Terraform code.
Provider Configuration Errors
Symptoms:
- “Error configuring provider”
- Authentication errors
Solutions:
# Verify AWS credentials
aws sts get-caller-identity
# Verify environment variables
env | grep TF_VAR
env | grep AWS
# Re-initialize Terraform
rm -rf .terraform
terraform init
ArgoCD Issues
Applications Out of Sync
Symptoms:
- Application shows “OutOfSync” status
- Deployed resources don’t match Git
Solutions:
# Check application status
kubectl get application <app-name> -n argocd -o yaml
# Sync application
argocd app sync <app-name>
# Force sync (ignores differences)
argocd app sync <app-name> --force
# Check sync errors
argocd app get <app-name>
GitHub Integration Not Working
Symptoms:
- ApplicationSets not discovering repos
- “Failed to list repositories” errors
Solutions:
Verify GitHub App credentials
kubectl get secret -n argocd github-app-secret -o yaml
Ensure:
- App ID is correct
- Installation ID is correct
- Private key is valid
Check GitHub App permissions
In GitHub, verify the app has:
- Repository: Contents (Read)
- Repository: Metadata (Read)
Check ArgoCD application controller logs
kubectl logs -n argocd deployment/argocd-application-controller
Network Issues
DNS Not Resolving
Symptoms:
- Services not accessible by domain name
- “Name or service not known” errors
Solutions:
# Check external-dns pod
kubectl get pods -n external-dns
# Check external-dns logs
kubectl logs -n external-dns deployment/external-dns
# Verify Cloudflare DNS records
curl -X GET "https://api.cloudflare.com/client/v4/zones/<zone-id>/dns_records" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN"
# Test DNS resolution
nslookup vault.yourdomain.com
dig vault.yourdomain.com
Load Balancer Not Created
Symptoms:
- Ingress has no external IP/hostname
- Service of type LoadBalancer stuck pending
Solutions:
# Check AWS Load Balancer Controller logs
kubectl logs -n kube-system deployment/aws-load-balancer-controller
# Verify ingress annotations
kubectl get ingress <ingress-name> -n <namespace> -o yaml
# Check service events
kubectl describe service <service-name> -n <namespace>
# Verify subnet tags
aws ec2 describe-subnets --filters Name=vpc-id,Values=<vpc-id> \
--query 'Subnets[*].[SubnetId,Tags]'
Getting More Help
If you continue to experience issues:
- Check logs: Most issues can be diagnosed from pod and service logs
- Review AWS Console: Check CloudWatch logs, security groups, and IAM permissions
- Verify prerequisites: Ensure all required tools and accounts are properly configured
- Check resource quotas: AWS service quotas may limit resource creation
Destroying Resources
If you need to start over, learn how to safely tear down infrastructure