Application Issues
Backend Not Starting
Symptom: Tasks immediately stop after starting
Symptom: Tasks immediately stop after starting
-
Missing secrets - Check SSM Parameter Store
-
Database connection failure
- Verify RDS endpoint is correct in task definition
- Check security group allows backend → RDS connection
- Test connectivity from ECS task:
-
Insufficient resources
- Increase
cpuandmemoryin task definition - Check CloudWatch logs for OOM (Out of Memory) errors
- Increase
Frontend Not Accessible
Symptom: ALB returns 503 Service Unavailable
Symptom: ALB returns 503 Service Unavailable
API Returns 500 Errors
Symptom: Backend endpoints return Internal Server Error
Symptom: Backend endpoints return Internal Server Error
-
Database connection pool exhausted
Solution: Increase Hikari pool size in
application.yml: -
JWT validation failures
- Verify
JWT_SECRETmatches between environments - Check token expiration (
JWT_EXPIRATION_MS) - Test token generation:
- Verify
-
CORS errors
- Check
CORS_ALLOWED_ORIGINSincludes frontend domain - Verify ALB listener rules route correctly
- Test CORS preflight:
- Check
Database Issues
Connection Timeout
Symptom: Connection to RDS times out
Symptom: Connection to RDS times out
High Connection Count
Symptom: Database rejecting new connections
Symptom: Database rejecting new connections
-
Increase RDS max_connections:
-
Reduce Hikari pool size per task:
application.yml
-
Scale down backend tasks (if overprovisioned):
Slow Query Performance
Symptom: API response times > 2 seconds
Symptom: API response times > 2 seconds
-
Add indexes for frequent lookups:
-
Upgrade RDS instance class:
Networking Issues
ALB Not Routing Correctly
Symptom: API calls return 404 or route to wrong service
Symptom: API calls return 404 or route to wrong service
- Priority 10:
/api/*→ backend target group - Priority 20:
/{shortcode}→ backend target group - Default:
/*→ frontend target group
HTTPS Certificate Issues
Symptom: Browser shows SSL/TLS errors
Symptom: Browser shows SSL/TLS errors
DNS Resolution Failures
Symptom: Frontend cannot resolve backend hostname
Symptom: Frontend cannot resolve backend hostname
Deployment Issues
Task Won’t Reach Steady State
Symptom: Deployment stuck in progress
Symptom: Deployment stuck in progress
| Event Message | Cause | Solution |
|---|---|---|
service ... has reached a steady state | Deployment successful | None |
(service ...) was unable to place a task | Resource constraints | Check subnet IPs, service quotas |
(service ...) failed to launch a task with (error ECS was unable to assume the role) | IAM permissions | Fix task execution role |
(service ...) has started 2 tasks: (task abc123) | Tasks starting | Wait or check task logs |
Circuit Breaker Triggered
Symptom: Deployment automatically rolls back
Symptom: Deployment automatically rolls back
- Health check failures - Adjust
startPeriodin task definition - Container crashes - Check application logs
- Resource constraints - Increase task CPU/memory
Image Pull Errors
Symptom: CannotPullContainerError
Symptom: CannotPullContainerError
Scaling Issues
Auto Scaling Not Triggering
Symptom: Service stays at min capacity despite high load
Symptom: Service stays at min capacity despite high load
Scheduled Tasks Not Running
Symptom: Cleanup job not executing
Symptom: Cleanup job not executing
@Scheduled annotation:Quick Diagnostics Commands
Run these commands for a quick health check:health-check.sh and run:
Rollback Procedures
Emergency Rollback Steps
Rollback Using Specific Image Tag
Getting Help
If you’re still experiencing issues:- Check AWS Service Health: https://status.aws.amazon.com/
- Review CloudWatch Logs: Look for stack traces and error patterns
- Enable ECS Exec: Connect directly to containers for debugging
- Contact Support: AWS Support Console or your team’s operations channel
Enabling ECS Exec for Debugging
Next Steps
- Review monitoring setup to catch issues proactively
- Configure automated alerts for critical metrics
- Document your team’s incident response procedures