Troubleshooting Guide

Diagnose and resolve issues with your ADMA URL shortener deployment using this comprehensive troubleshooting guide.

Application Issues

Backend Not Starting

Symptom: Tasks immediately stop after starting

Check stopped task reason:

# Get the most recent stopped task ID
TASK_ID=$(aws ecs list-tasks \
  --cluster adma-cluster \
  --service-name adma-backend \
  --desired-status STOPPED \
  --max-items 1 \
  --query 'taskArns[0]' \
  --output text)

# Describe the stopped task
aws ecs describe-tasks \
  --cluster adma-cluster \
  --tasks $TASK_ID \
  --query 'tasks[0].{StoppedReason:stoppedReason,Containers:containers[].{Name:name,Reason:reason,ExitCode:exitCode}}'

Common causes:

Missing secrets - Check SSM Parameter Store

aws ssm get-parameter --name /adma/prod/JWT_SECRET
aws ssm get-parameter --name /adma/prod/DB_PASSWORD

Database connection failure

Verify RDS endpoint is correct in task definition
Check security group allows backend → RDS connection

Test connectivity from ECS task:

aws ecs execute-command \
  --cluster adma-cluster \
  --task TASK_ID \
  --container backend \
  --interactive \
  --command "/bin/sh"
# Inside container:
nc -zv $DB_HOST $DB_PORT

Insufficient resources
- Increase cpu and memory in task definition
- Check CloudWatch logs for OOM (Out of Memory) errors

View logs:

aws logs tail /ecs/adma-prod-backend --follow --since 30m

Frontend Not Accessible

Symptom: ALB returns 503 Service Unavailable

Check target health:

# Get target group ARN
TARGET_GROUP=$(aws elbv2 describe-target-groups \
  --names adma-prod-feg \
  --query 'TargetGroups[0].TargetGroupArn' \
  --output text)

# Check target health status
aws elbv2 describe-target-health \
  --target-group-arn $TARGET_GROUP

Possible states:

State	Description	Action
`initial`	Target is registering	Wait 30-60 seconds
`healthy`	Target is serving traffic	No action needed
`unhealthy`	Health checks failing	Check container logs
`draining`	Target is deregistering	Wait for replacement task
`unavailable`	Target is not registered	Check ECS service desired count

Debug unhealthy targets:

Check container health check:

aws ecs describe-tasks \
  --cluster adma-cluster \
  --tasks TASK_ID \
  --query 'tasks[0].containers[0].healthStatus'

Test health endpoint manually:

# Get task private IP
PRIVATE_IP=$(aws ecs describe-tasks \
  --cluster adma-cluster \
  --tasks TASK_ID \
  --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
  --output text)

# Test from bastion or another ECS task
curl -I http://$PRIVATE_IP:80/

Review frontend logs:

aws logs tail /ecs/adma-prod-frontend --follow

API Returns 500 Errors

Symptom: Backend endpoints return Internal Server Error

View application logs:

# Filter for ERROR level logs
aws logs filter-log-events \
  --log-group-name /ecs/adma-prod-backend \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].[timestamp,message]' \
  --output text

Common issues:

Database connection pool exhausted

-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'urlshortener';

-- Check max connections
SHOW max_connections;

Solution: Increase Hikari pool size in application.yml:

spring:
  datasource:
    hikari:
      maximum-pool-size: 20  # Increase from 10

JWT validation failures

Verify JWT_SECRET matches between environments
Check token expiration (JWT_EXPIRATION_MS)

Test token generation:

curl -X POST https://api.yourdomain.com/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"password123"}'

CORS errors

Check CORS_ALLOWED_ORIGINS includes frontend domain
Verify ALB listener rules route correctly

Test CORS preflight:

curl -X OPTIONS https://api.yourdomain.com/api/urls \
  -H "Origin: https://yourdomain.com" \
  -H "Access-Control-Request-Method: POST" \
  -v

Database Issues

Connection Timeout

Symptom: Connection to RDS times out

Check security groups:

# Get RDS security group
RDS_SG=$(aws rds describe-db-instances \
  --db-instance-identifier adma-prod-postgres \
  --query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
  --output text)

# Check inbound rules
aws ec2 describe-security-groups \
  --group-ids $RDS_SG \
  --query 'SecurityGroups[0].IpPermissions[?ToPort==`5432`]'

Verify rule allows backend security group:

Expected Rule

{
  "FromPort": 5432,
  "ToPort": 5432,
  "IpProtocol": "tcp",
  "UserIdGroupPairs": [
    {
      "GroupId": "sg-backend-xxxxx"
    }
  ]
}

Add missing rule:

BACKEND_SG=$(aws ec2 describe-security-groups \
  --filters "Name=tag:Name,Values=adma-prod-backend-sg" \
  --query 'SecurityGroups[0].GroupId' \
  --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $RDS_SG \
  --protocol tcp \
  --port 5432 \
  --source-group $BACKEND_SG

High Connection Count

Symptom: Database rejecting new connections

Monitor connection metrics:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=adma-prod-postgres \
  --start-time $(date -u -d '1 hour ago' --iso-8601=seconds) \
  --end-time $(date -u --iso-8601=seconds) \
  --period 300 \
  --statistics Maximum,Average

Check max connections:

# Connect to RDS
psql -h adma-postgres.xxxxx.eu-west-1.rds.amazonaws.com \
     -U appuser \
     -d urlshortener

-- Inside psql:
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity;
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;

Solutions:

Increase RDS max_connections:

aws rds modify-db-parameter-group \
  --db-parameter-group-name adma-prod-postgres-params \
  --parameters "ParameterName=max_connections,ParameterValue=200,ApplyMethod=immediate"

Reduce Hikari pool size per task:

application.yml

spring:
  datasource:
    hikari:
      maximum-pool-size: 5  # Reduce from 10
      minimum-idle: 2

Scale down backend tasks (if overprovisioned):

aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --desired-count 1

Slow Query Performance

Symptom: API response times > 2 seconds

Enable RDS Performance Insights:

aws rds modify-db-instance \
  --db-instance-identifier adma-prod-postgres \
  --enable-performance-insights \
  --performance-insights-retention-period 7

Check slow query logs:

aws logs tail /aws/rds/instance/adma-prod-postgres/postgresql --follow

Analyze query patterns:

-- Enable pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slowest queries
SELECT 
  query,
  calls,
  total_exec_time,
  mean_exec_time,
  max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Common optimizations:

Add indexes for frequent lookups:

CREATE INDEX idx_short_urls_short_code ON short_urls(short_code);
CREATE INDEX idx_short_urls_user_status ON short_urls(user_id, link_status);

Upgrade RDS instance class:

aws rds modify-db-instance \
  --db-instance-identifier adma-prod-postgres \
  --db-instance-class db.t3.small \
  --apply-immediately

Networking Issues

ALB Not Routing Correctly

Symptom: API calls return 404 or route to wrong service

Check listener rules:

# Get ALB ARN
ALB_ARN=$(aws elbv2 describe-load-balancers \
  --names adma-prod-alb \
  --query 'LoadBalancers[0].LoadBalancerArn' \
  --output text)

# List listeners
aws elbv2 describe-listeners \
  --load-balancer-arn $ALB_ARN

# Get HTTPS listener ARN (port 443)
LISTENER_ARN=$(aws elbv2 describe-listeners \
  --load-balancer-arn $ALB_ARN \
  --query 'Listeners[?Port==`443`].ListenerArn' \
  --output text)

# Check rules
aws elbv2 describe-rules \
  --listener-arn $LISTENER_ARN \
  --query 'Rules[*].{Priority:Priority,Conditions:Conditions,Actions:Actions}'

Expected rules:

Priority 10: /api/* → backend target group
Priority 20: /{shortcode} → backend target group
Default: /* → frontend target group

Fix missing rule:

BACKEND_TG=$(aws elbv2 describe-target-groups \
  --names adma-prod-backend-tg \
  --query 'TargetGroups[0].TargetGroupArn' \
  --output text)

aws elbv2 create-rule \
  --listener-arn $LISTENER_ARN \
  --priority 10 \
  --conditions '[{"Field":"path-pattern","Values":["/api/*"]}]' \
  --actions Type=forward,TargetGroupArn=$BACKEND_TG

HTTPS Certificate Issues

Symptom: Browser shows SSL/TLS errors

Check ACM certificate status:

aws acm list-certificates --region eu-west-1

# Describe specific certificate
aws acm describe-certificate \
  --certificate-arn arn:aws:acm:eu-west-1:ACCOUNT:certificate/xxxxx \
  --query 'Certificate.{Domain:DomainName,Status:Status,InUse:InUseBy}'

Validate certificate is attached to listener:

aws elbv2 describe-listeners \
  --listener-arns $LISTENER_ARN \
  --query 'Listeners[0].Certificates[0].CertificateArn'

Reissue certificate if expired:

aws acm request-certificate \
  --domain-name yourdomain.com \
  --subject-alternative-names "*.yourdomain.com" \
  --validation-method DNS \
  --region eu-west-1

Update listener with new certificate:

aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --certificates CertificateArn=arn:aws:acm:...

DNS Resolution Failures

Symptom: Frontend cannot resolve backend hostname

Check Service Discovery:

# Get private DNS namespace
aws servicediscovery list-namespaces \
  --filters Name=TYPE,Values=DNS_PRIVATE

# Check backend service registration
aws servicediscovery list-services

# Test DNS resolution from frontend container
aws ecs execute-command \
  --cluster adma-cluster \
  --task FRONTEND_TASK_ID \
  --container frontend \
  --interactive \
  --command "/bin/sh"

# Inside container:
nslookup backend.adma.local
wget -O- http://backend.adma.local:8080/actuator/health

Verify environment variable:

aws ecs describe-task-definition \
  --task-definition adma-prod-frontend:latest \
  --query 'taskDefinition.containerDefinitions[0].environment[?name==`BACKEND_UPSTREAM`]'

Expected value:

{
  "name": "BACKEND_UPSTREAM",
  "value": "backend.adma.local:8080"
}

Deployment Issues

Task Won’t Reach Steady State

Symptom: Deployment stuck in progress

Check service events:

aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].events[:10]'

Common messages:

Event Message	Cause	Solution
`service ... has reached a steady state`	Deployment successful	None
`(service ...) was unable to place a task`	Resource constraints	Check subnet IPs, service quotas
`(service ...) failed to launch a task with (error ECS was unable to assume the role)`	IAM permissions	Fix task execution role
`(service ...) has started 2 tasks: (task abc123)`	Tasks starting	Wait or check task logs

Force rollback if stuck:

# Get previous stable task definition revision
PREVIOUS_REV=$(aws ecs list-task-definitions \
  --family-prefix adma-backend \
  --status ACTIVE \
  --sort DESC \
  --max-items 2 \
  --query 'taskDefinitionArns[1]' \
  --output text)

# Update service to previous revision
aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --task-definition $PREVIOUS_REV \
  --force-new-deployment

Circuit Breaker Triggered

Symptom: Deployment automatically rolls back

View deployment circuit breaker events:

aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].deployments[*].{Status:status,RolloutState:rolloutState,FailedTasks:failedTasks}'

Investigate failed tasks:

# Get failed task IDs from deployment
FAILED_TASKS=$(aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].deployments[0].failedTasks' \
  --output text)

# Describe failed tasks
for TASK in $FAILED_TASKS; do
  aws ecs describe-tasks \
    --cluster adma-cluster \
    --tasks $TASK \
    --query 'tasks[0].{StoppedReason:stoppedReason,Containers:containers[*].reason}'
done

Common causes:

Health check failures - Adjust startPeriod in task definition
Container crashes - Check application logs
Resource constraints - Increase task CPU/memory

Disable circuit breaker temporarily:

aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --deployment-configuration \
    "deploymentCircuitBreaker={enable=false}"

Only disable circuit breaker for debugging. Re-enable after identifying root cause.

Image Pull Errors

Symptom: CannotPullContainerError

Check ECR repository permissions:

# Verify image exists
aws ecr describe-images \
  --repository-name adma/backend \
  --image-ids imageTag=latest

# Check repository policy
aws ecr get-repository-policy \
  --repository-name adma/backend

Verify task execution role has ECR permissions:

aws iam get-role \
  --role-name ecsTaskExecutionRole \
  --query 'Role.RoleName'

aws iam list-attached-role-policies \
  --role-name ecsTaskExecutionRole

Attach missing ECR policy:

aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

Test image pull manually:

aws ecr get-login-password --region eu-west-1 \
  | docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.eu-west-1.amazonaws.com

docker pull ACCOUNT_ID.dkr.ecr.eu-west-1.amazonaws.com/adma/backend:latest

Scaling Issues

Auto Scaling Not Triggering

Symptom: Service stays at min capacity despite high load

Check scaling policies:

# List scaling policies
aws application-autoscaling describe-scaling-policies \
  --service-namespace ecs \
  --resource-id service/adma-cluster/adma-backend

# Check scaling alarms
aws cloudwatch describe-alarms \
  --alarm-name-prefix adma-prod-backend

Verify metrics are publishing:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=adma-prod-backend Name=ClusterName,Value=adma-prod-ecs \
  --start-time $(date -u -d '1 hour ago' --iso-8601=seconds) \
  --end-time $(date -u --iso-8601=seconds) \
  --period 60 \
  --statistics Average

Manually test scaling:

# Scale up
aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --desired-count 3

# Check if tasks start successfully
aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].{Running:runningCount,Desired:desiredCount}'

Adjust target values if needed:

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/adma-cluster/adma-backend \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name adma-backend-cpu \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://scaling-policy.json

Scheduled Tasks Not Running

Symptom: Cleanup job not executing

Check scheduled task configuration:The cleanup job runs inside the backend container via Spring’s @Scheduled annotation:

ExpiredUrlCleanupService.java

@Scheduled(cron = "0 */15 * * * *")  // Every 15 minutes
public void cleanupExpiredUrls() {
  // ...
}

Verify backend is running:

aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].{Running:runningCount,Desired:desiredCount}'

Search logs for cleanup execution:

aws logs filter-log-events \
  --log-group-name /ecs/adma-prod-backend \
  --filter-pattern "ExpiredUrlCleanupService" \
  --start-time $(date -u -d '1 hour ago' +%s)000

Expected log output:

2026-03-04 10:15:00.123 INFO  ExpiredUrlCleanupService - Starting cleanup of expired URLs
2026-03-04 10:15:00.456 INFO  ExpiredUrlCleanupService - Marked 5 expired URLs as EXPIRED
2026-03-04 10:15:00.789 INFO  ExpiredUrlCleanupService - Deleted 3 EXPIRED and DELETED URLs

If you scale backend beyond desired_count = 1, the job will run on every task simultaneously. Use ShedLock or migrate to ECS Scheduled Tasks to prevent duplicate execution.

Quick Diagnostics Commands

Run these commands for a quick health check:

Quick Health Check

#!/bin/bash

CLUSTER="adma-cluster"
REGION="eu-west-1"

echo "=== ECS Services ==="
aws ecs describe-services \
  --cluster $CLUSTER \
  --services adma-backend adma-frontend \
  --query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}' \
  --output table

echo "\n=== ALB Target Health ==="
for TG in $(aws elbv2 describe-target-groups --query 'TargetGroups[*].TargetGroupArn' --output text); do
  aws elbv2 describe-target-health --target-group-arn $TG --query 'TargetHealthDescriptions[*].{Target:Target.Id,Port:Target.Port,Health:TargetHealth.State}' --output table
done

echo "\n=== RDS Status ==="
aws rds describe-db-instances \
  --db-instance-identifier adma-prod-postgres \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint.Address,Connections:PendingModifiedValues}' \
  --output table

echo "\n=== Recent Errors ==="
aws logs filter-log-events \
  --log-group-name /ecs/adma-prod-backend \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  --query 'events[*].[timestamp,message]' \
  --output text | head -10

echo "\n=== Service Events ==="
aws ecs describe-services \
  --cluster $CLUSTER \
  --services adma-backend \
  --query 'services[0].events[:5]' \
  --output table

Save as health-check.sh and run:

chmod +x health-check.sh
./health-check.sh

Rollback Procedures

Emergency Rollback Steps

Identify last stable version

git log --oneline -10
# Note the commit SHA of the last known good deployment

Find task definition revision

aws ecs list-task-definitions \
  --family-prefix adma-backend \
  --status ACTIVE \
  --sort DESC \
  --max-items 5

Update service to previous revision

aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --task-definition adma-backend:12 \
  --force-new-deployment

Monitor rollback progress

aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].deployments[*].{Status:status,TaskDef:taskDefinition,Running:runningCount}'

Verify application health

curl https://api.yourdomain.com/actuator/health
curl https://api.yourdomain.com/api/stats

Rollback Using Specific Image Tag

Update task definition JSON

Edit infrastructure/task-def-backend.json:

"image": "123456789012.dkr.ecr.eu-west-1.amazonaws.com/adma/backend:a1b2c3d"

aws ecs register-task-definition \
  --cli-input-json file://infrastructure/task-def-backend.json

Deploy

NEW_REV=$(aws ecs describe-task-definition \
  --task-definition adma-backend \
  --query 'taskDefinition.revision' \
  --output text)

aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --task-definition adma-backend:$NEW_REV \
  --force-new-deployment

Getting Help

If you’re still experiencing issues:

Check AWS Service Health: https://status.aws.amazon.com/
Review CloudWatch Logs: Look for stack traces and error patterns
Enable ECS Exec: Connect directly to containers for debugging
Contact Support: AWS Support Console or your team’s operations channel

Enabling ECS Exec for Debugging

# Enable ECS Exec on service
aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --enable-execute-command

# Connect to running task
TASK_ID=$(aws ecs list-tasks \
  --cluster adma-cluster \
  --service-name adma-backend \
  --desired-status RUNNING \
  --query 'taskArns[0]' \
  --output text | awk -F/ '{print $NF}')

aws ecs execute-command \
  --cluster adma-cluster \
  --task $TASK_ID \
  --container backend \
  --interactive \
  --command "/bin/bash"

ECS Exec requires Session Manager plugin:

# macOS
brew install --cask session-manager-plugin

# Linux
curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb" -o "session-manager-plugin.deb"
sudo dpkg -i session-manager-plugin.deb

Next Steps

Review monitoring setup to catch issues proactively
Configure automated alerts for critical metrics
Document your team’s incident response procedures

Overview

Getting Started

Deployment

Infrastructure

Operations

Application Issues

Backend Not Starting

Frontend Not Accessible

API Returns 500 Errors

Database Issues

Connection Timeout

High Connection Count

Slow Query Performance

Networking Issues

ALB Not Routing Correctly

HTTPS Certificate Issues

DNS Resolution Failures

Deployment Issues

Task Won’t Reach Steady State

Circuit Breaker Triggered

Image Pull Errors

Scaling Issues

Auto Scaling Not Triggering

Scheduled Tasks Not Running

Quick Diagnostics Commands

Rollback Procedures

Emergency Rollback Steps

Rollback Using Specific Image Tag

Getting Help

Enabling ECS Exec for Debugging

Next Steps

Build docs developers (and LLMs) love

Overview

Getting Started

Deployment

Infrastructure

Operations

​Application Issues

​Backend Not Starting

​Frontend Not Accessible

​API Returns 500 Errors

​Database Issues

​Connection Timeout

​High Connection Count

​Slow Query Performance

​Networking Issues

​ALB Not Routing Correctly

​HTTPS Certificate Issues

​DNS Resolution Failures

​Deployment Issues

​Task Won’t Reach Steady State

​Circuit Breaker Triggered

​Image Pull Errors

​Scaling Issues

​Auto Scaling Not Triggering

​Scheduled Tasks Not Running

​Quick Diagnostics Commands

​Rollback Procedures

​Emergency Rollback Steps

​Rollback Using Specific Image Tag

​Getting Help

​Enabling ECS Exec for Debugging

​Next Steps

Build docs developers (and LLMs) love

Application Issues

Backend Not Starting

Frontend Not Accessible

API Returns 500 Errors

Database Issues

Connection Timeout

High Connection Count

Slow Query Performance

Networking Issues

ALB Not Routing Correctly

HTTPS Certificate Issues

DNS Resolution Failures

Deployment Issues

Task Won’t Reach Steady State

Circuit Breaker Triggered

Image Pull Errors

Scaling Issues

Auto Scaling Not Triggering

Scheduled Tasks Not Running

Quick Diagnostics Commands

Rollback Procedures

Emergency Rollback Steps

Rollback Using Specific Image Tag

Getting Help

Enabling ECS Exec for Debugging

Next Steps