Skip to main content
Diagnose and resolve issues with your ADMA URL shortener deployment using this comprehensive troubleshooting guide.

Application Issues

Backend Not Starting

Check stopped task reason:
# Get the most recent stopped task ID
TASK_ID=$(aws ecs list-tasks \
  --cluster adma-cluster \
  --service-name adma-backend \
  --desired-status STOPPED \
  --max-items 1 \
  --query 'taskArns[0]' \
  --output text)

# Describe the stopped task
aws ecs describe-tasks \
  --cluster adma-cluster \
  --tasks $TASK_ID \
  --query 'tasks[0].{StoppedReason:stoppedReason,Containers:containers[].{Name:name,Reason:reason,ExitCode:exitCode}}'
Common causes:
  1. Missing secrets - Check SSM Parameter Store
    aws ssm get-parameter --name /adma/prod/JWT_SECRET
    aws ssm get-parameter --name /adma/prod/DB_PASSWORD
    
  2. Database connection failure
    • Verify RDS endpoint is correct in task definition
    • Check security group allows backend → RDS connection
    • Test connectivity from ECS task:
      aws ecs execute-command \
        --cluster adma-cluster \
        --task TASK_ID \
        --container backend \
        --interactive \
        --command "/bin/sh"
      # Inside container:
      nc -zv $DB_HOST $DB_PORT
      
  3. Insufficient resources
    • Increase cpu and memory in task definition
    • Check CloudWatch logs for OOM (Out of Memory) errors
View logs:
aws logs tail /ecs/adma-prod-backend --follow --since 30m

Frontend Not Accessible

Check target health:
# Get target group ARN
TARGET_GROUP=$(aws elbv2 describe-target-groups \
  --names adma-prod-feg \
  --query 'TargetGroups[0].TargetGroupArn' \
  --output text)

# Check target health status
aws elbv2 describe-target-health \
  --target-group-arn $TARGET_GROUP
Possible states:
StateDescriptionAction
initialTarget is registeringWait 30-60 seconds
healthyTarget is serving trafficNo action needed
unhealthyHealth checks failingCheck container logs
drainingTarget is deregisteringWait for replacement task
unavailableTarget is not registeredCheck ECS service desired count
Debug unhealthy targets:
  1. Check container health check:
    aws ecs describe-tasks \
      --cluster adma-cluster \
      --tasks TASK_ID \
      --query 'tasks[0].containers[0].healthStatus'
    
  2. Test health endpoint manually:
    # Get task private IP
    PRIVATE_IP=$(aws ecs describe-tasks \
      --cluster adma-cluster \
      --tasks TASK_ID \
      --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
      --output text)
    
    # Test from bastion or another ECS task
    curl -I http://$PRIVATE_IP:80/
    
  3. Review frontend logs:
    aws logs tail /ecs/adma-prod-frontend --follow
    

API Returns 500 Errors

View application logs:
# Filter for ERROR level logs
aws logs filter-log-events \
  --log-group-name /ecs/adma-prod-backend \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].[timestamp,message]' \
  --output text
Common issues:
  1. Database connection pool exhausted
    -- Check active connections
    SELECT count(*) FROM pg_stat_activity WHERE datname = 'urlshortener';
    
    -- Check max connections
    SHOW max_connections;
    
    Solution: Increase Hikari pool size in application.yml:
    spring:
      datasource:
        hikari:
          maximum-pool-size: 20  # Increase from 10
    
  2. JWT validation failures
    • Verify JWT_SECRET matches between environments
    • Check token expiration (JWT_EXPIRATION_MS)
    • Test token generation:
      curl -X POST https://api.yourdomain.com/api/auth/login \
        -H "Content-Type: application/json" \
        -d '{"email":"test@example.com","password":"password123"}'
      
  3. CORS errors
    • Check CORS_ALLOWED_ORIGINS includes frontend domain
    • Verify ALB listener rules route correctly
    • Test CORS preflight:
      curl -X OPTIONS https://api.yourdomain.com/api/urls \
        -H "Origin: https://yourdomain.com" \
        -H "Access-Control-Request-Method: POST" \
        -v
      

Database Issues

Connection Timeout

Check security groups:
# Get RDS security group
RDS_SG=$(aws rds describe-db-instances \
  --db-instance-identifier adma-prod-postgres \
  --query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' \
  --output text)

# Check inbound rules
aws ec2 describe-security-groups \
  --group-ids $RDS_SG \
  --query 'SecurityGroups[0].IpPermissions[?ToPort==`5432`]'
Verify rule allows backend security group:
Expected Rule
{
  "FromPort": 5432,
  "ToPort": 5432,
  "IpProtocol": "tcp",
  "UserIdGroupPairs": [
    {
      "GroupId": "sg-backend-xxxxx"
    }
  ]
}
Add missing rule:
BACKEND_SG=$(aws ec2 describe-security-groups \
  --filters "Name=tag:Name,Values=adma-prod-backend-sg" \
  --query 'SecurityGroups[0].GroupId' \
  --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $RDS_SG \
  --protocol tcp \
  --port 5432 \
  --source-group $BACKEND_SG

High Connection Count

Monitor connection metrics:
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=adma-prod-postgres \
  --start-time $(date -u -d '1 hour ago' --iso-8601=seconds) \
  --end-time $(date -u --iso-8601=seconds) \
  --period 300 \
  --statistics Maximum,Average
Check max connections:
# Connect to RDS
psql -h adma-postgres.xxxxx.eu-west-1.rds.amazonaws.com \
     -U appuser \
     -d urlshortener

-- Inside psql:
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity;
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
Solutions:
  1. Increase RDS max_connections:
    aws rds modify-db-parameter-group \
      --db-parameter-group-name adma-prod-postgres-params \
      --parameters "ParameterName=max_connections,ParameterValue=200,ApplyMethod=immediate"
    
  2. Reduce Hikari pool size per task:
    application.yml
    spring:
      datasource:
        hikari:
          maximum-pool-size: 5  # Reduce from 10
          minimum-idle: 2
    
  3. Scale down backend tasks (if overprovisioned):
    aws ecs update-service \
      --cluster adma-cluster \
      --service adma-backend \
      --desired-count 1
    

Slow Query Performance

Enable RDS Performance Insights:
aws rds modify-db-instance \
  --db-instance-identifier adma-prod-postgres \
  --enable-performance-insights \
  --performance-insights-retention-period 7
Check slow query logs:
aws logs tail /aws/rds/instance/adma-prod-postgres/postgresql --follow
Analyze query patterns:
-- Enable pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slowest queries
SELECT 
  query,
  calls,
  total_exec_time,
  mean_exec_time,
  max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
Common optimizations:
  1. Add indexes for frequent lookups:
    CREATE INDEX idx_short_urls_short_code ON short_urls(short_code);
    CREATE INDEX idx_short_urls_user_status ON short_urls(user_id, link_status);
    
  2. Upgrade RDS instance class:
    aws rds modify-db-instance \
      --db-instance-identifier adma-prod-postgres \
      --db-instance-class db.t3.small \
      --apply-immediately
    

Networking Issues

ALB Not Routing Correctly

Check listener rules:
# Get ALB ARN
ALB_ARN=$(aws elbv2 describe-load-balancers \
  --names adma-prod-alb \
  --query 'LoadBalancers[0].LoadBalancerArn' \
  --output text)

# List listeners
aws elbv2 describe-listeners \
  --load-balancer-arn $ALB_ARN

# Get HTTPS listener ARN (port 443)
LISTENER_ARN=$(aws elbv2 describe-listeners \
  --load-balancer-arn $ALB_ARN \
  --query 'Listeners[?Port==`443`].ListenerArn' \
  --output text)

# Check rules
aws elbv2 describe-rules \
  --listener-arn $LISTENER_ARN \
  --query 'Rules[*].{Priority:Priority,Conditions:Conditions,Actions:Actions}'
Expected rules:
  1. Priority 10: /api/* → backend target group
  2. Priority 20: /{shortcode} → backend target group
  3. Default: /* → frontend target group
Fix missing rule:
BACKEND_TG=$(aws elbv2 describe-target-groups \
  --names adma-prod-backend-tg \
  --query 'TargetGroups[0].TargetGroupArn' \
  --output text)

aws elbv2 create-rule \
  --listener-arn $LISTENER_ARN \
  --priority 10 \
  --conditions '[{"Field":"path-pattern","Values":["/api/*"]}]' \
  --actions Type=forward,TargetGroupArn=$BACKEND_TG

HTTPS Certificate Issues

Check ACM certificate status:
aws acm list-certificates --region eu-west-1

# Describe specific certificate
aws acm describe-certificate \
  --certificate-arn arn:aws:acm:eu-west-1:ACCOUNT:certificate/xxxxx \
  --query 'Certificate.{Domain:DomainName,Status:Status,InUse:InUseBy}'
Validate certificate is attached to listener:
aws elbv2 describe-listeners \
  --listener-arns $LISTENER_ARN \
  --query 'Listeners[0].Certificates[0].CertificateArn'
Reissue certificate if expired:
aws acm request-certificate \
  --domain-name yourdomain.com \
  --subject-alternative-names "*.yourdomain.com" \
  --validation-method DNS \
  --region eu-west-1
Update listener with new certificate:
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --certificates CertificateArn=arn:aws:acm:...

DNS Resolution Failures

Check Service Discovery:
# Get private DNS namespace
aws servicediscovery list-namespaces \
  --filters Name=TYPE,Values=DNS_PRIVATE

# Check backend service registration
aws servicediscovery list-services

# Test DNS resolution from frontend container
aws ecs execute-command \
  --cluster adma-cluster \
  --task FRONTEND_TASK_ID \
  --container frontend \
  --interactive \
  --command "/bin/sh"

# Inside container:
nslookup backend.adma.local
wget -O- http://backend.adma.local:8080/actuator/health
Verify environment variable:
aws ecs describe-task-definition \
  --task-definition adma-prod-frontend:latest \
  --query 'taskDefinition.containerDefinitions[0].environment[?name==`BACKEND_UPSTREAM`]'
Expected value:
{
  "name": "BACKEND_UPSTREAM",
  "value": "backend.adma.local:8080"
}

Deployment Issues

Task Won’t Reach Steady State

Check service events:
aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].events[:10]'
Common messages:
Event MessageCauseSolution
service ... has reached a steady stateDeployment successfulNone
(service ...) was unable to place a taskResource constraintsCheck subnet IPs, service quotas
(service ...) failed to launch a task with (error ECS was unable to assume the role)IAM permissionsFix task execution role
(service ...) has started 2 tasks: (task abc123)Tasks startingWait or check task logs
Force rollback if stuck:
# Get previous stable task definition revision
PREVIOUS_REV=$(aws ecs list-task-definitions \
  --family-prefix adma-backend \
  --status ACTIVE \
  --sort DESC \
  --max-items 2 \
  --query 'taskDefinitionArns[1]' \
  --output text)

# Update service to previous revision
aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --task-definition $PREVIOUS_REV \
  --force-new-deployment

Circuit Breaker Triggered

View deployment circuit breaker events:
aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].deployments[*].{Status:status,RolloutState:rolloutState,FailedTasks:failedTasks}'
Investigate failed tasks:
# Get failed task IDs from deployment
FAILED_TASKS=$(aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].deployments[0].failedTasks' \
  --output text)

# Describe failed tasks
for TASK in $FAILED_TASKS; do
  aws ecs describe-tasks \
    --cluster adma-cluster \
    --tasks $TASK \
    --query 'tasks[0].{StoppedReason:stoppedReason,Containers:containers[*].reason}'
done
Common causes:
  1. Health check failures - Adjust startPeriod in task definition
  2. Container crashes - Check application logs
  3. Resource constraints - Increase task CPU/memory
Disable circuit breaker temporarily:
aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --deployment-configuration \
    "deploymentCircuitBreaker={enable=false}"
Only disable circuit breaker for debugging. Re-enable after identifying root cause.

Image Pull Errors

Check ECR repository permissions:
# Verify image exists
aws ecr describe-images \
  --repository-name adma/backend \
  --image-ids imageTag=latest

# Check repository policy
aws ecr get-repository-policy \
  --repository-name adma/backend
Verify task execution role has ECR permissions:
aws iam get-role \
  --role-name ecsTaskExecutionRole \
  --query 'Role.RoleName'

aws iam list-attached-role-policies \
  --role-name ecsTaskExecutionRole
Attach missing ECR policy:
aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
Test image pull manually:
aws ecr get-login-password --region eu-west-1 \
  | docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.eu-west-1.amazonaws.com

docker pull ACCOUNT_ID.dkr.ecr.eu-west-1.amazonaws.com/adma/backend:latest

Scaling Issues

Auto Scaling Not Triggering

Check scaling policies:
# List scaling policies
aws application-autoscaling describe-scaling-policies \
  --service-namespace ecs \
  --resource-id service/adma-cluster/adma-backend

# Check scaling alarms
aws cloudwatch describe-alarms \
  --alarm-name-prefix adma-prod-backend
Verify metrics are publishing:
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=adma-prod-backend Name=ClusterName,Value=adma-prod-ecs \
  --start-time $(date -u -d '1 hour ago' --iso-8601=seconds) \
  --end-time $(date -u --iso-8601=seconds) \
  --period 60 \
  --statistics Average
Manually test scaling:
# Scale up
aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --desired-count 3

# Check if tasks start successfully
aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].{Running:runningCount,Desired:desiredCount}'
Adjust target values if needed:
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/adma-cluster/adma-backend \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name adma-backend-cpu \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://scaling-policy.json

Scheduled Tasks Not Running

Check scheduled task configuration:The cleanup job runs inside the backend container via Spring’s @Scheduled annotation:
ExpiredUrlCleanupService.java
@Scheduled(cron = "0 */15 * * * *")  // Every 15 minutes
public void cleanupExpiredUrls() {
  // ...
}
Verify backend is running:
aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].{Running:runningCount,Desired:desiredCount}'
Search logs for cleanup execution:
aws logs filter-log-events \
  --log-group-name /ecs/adma-prod-backend \
  --filter-pattern "ExpiredUrlCleanupService" \
  --start-time $(date -u -d '1 hour ago' +%s)000
Expected log output:
2026-03-04 10:15:00.123 INFO  ExpiredUrlCleanupService - Starting cleanup of expired URLs
2026-03-04 10:15:00.456 INFO  ExpiredUrlCleanupService - Marked 5 expired URLs as EXPIRED
2026-03-04 10:15:00.789 INFO  ExpiredUrlCleanupService - Deleted 3 EXPIRED and DELETED URLs
If you scale backend beyond desired_count = 1, the job will run on every task simultaneously. Use ShedLock or migrate to ECS Scheduled Tasks to prevent duplicate execution.

Quick Diagnostics Commands

Run these commands for a quick health check:
Quick Health Check
#!/bin/bash

CLUSTER="adma-cluster"
REGION="eu-west-1"

echo "=== ECS Services ==="
aws ecs describe-services \
  --cluster $CLUSTER \
  --services adma-backend adma-frontend \
  --query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}' \
  --output table

echo "\n=== ALB Target Health ==="
for TG in $(aws elbv2 describe-target-groups --query 'TargetGroups[*].TargetGroupArn' --output text); do
  aws elbv2 describe-target-health --target-group-arn $TG --query 'TargetHealthDescriptions[*].{Target:Target.Id,Port:Target.Port,Health:TargetHealth.State}' --output table
done

echo "\n=== RDS Status ==="
aws rds describe-db-instances \
  --db-instance-identifier adma-prod-postgres \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint.Address,Connections:PendingModifiedValues}' \
  --output table

echo "\n=== Recent Errors ==="
aws logs filter-log-events \
  --log-group-name /ecs/adma-prod-backend \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  --query 'events[*].[timestamp,message]' \
  --output text | head -10

echo "\n=== Service Events ==="
aws ecs describe-services \
  --cluster $CLUSTER \
  --services adma-backend \
  --query 'services[0].events[:5]' \
  --output table
Save as health-check.sh and run:
chmod +x health-check.sh
./health-check.sh

Rollback Procedures

Emergency Rollback Steps

1

Identify last stable version

git log --oneline -10
# Note the commit SHA of the last known good deployment
2

Find task definition revision

aws ecs list-task-definitions \
  --family-prefix adma-backend \
  --status ACTIVE \
  --sort DESC \
  --max-items 5
3

Update service to previous revision

aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --task-definition adma-backend:12 \
  --force-new-deployment
4

Monitor rollback progress

aws ecs describe-services \
  --cluster adma-cluster \
  --services adma-backend \
  --query 'services[0].deployments[*].{Status:status,TaskDef:taskDefinition,Running:runningCount}'
5

Verify application health

curl https://api.yourdomain.com/actuator/health
curl https://api.yourdomain.com/api/stats

Rollback Using Specific Image Tag

1

Update task definition JSON

Edit infrastructure/task-def-backend.json:
"image": "123456789012.dkr.ecr.eu-west-1.amazonaws.com/adma/backend:a1b2c3d"
2

Register new task definition

aws ecs register-task-definition \
  --cli-input-json file://infrastructure/task-def-backend.json
3

Deploy

NEW_REV=$(aws ecs describe-task-definition \
  --task-definition adma-backend \
  --query 'taskDefinition.revision' \
  --output text)

aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --task-definition adma-backend:$NEW_REV \
  --force-new-deployment

Getting Help

If you’re still experiencing issues:
  1. Check AWS Service Health: https://status.aws.amazon.com/
  2. Review CloudWatch Logs: Look for stack traces and error patterns
  3. Enable ECS Exec: Connect directly to containers for debugging
  4. Contact Support: AWS Support Console or your team’s operations channel

Enabling ECS Exec for Debugging

# Enable ECS Exec on service
aws ecs update-service \
  --cluster adma-cluster \
  --service adma-backend \
  --enable-execute-command

# Connect to running task
TASK_ID=$(aws ecs list-tasks \
  --cluster adma-cluster \
  --service-name adma-backend \
  --desired-status RUNNING \
  --query 'taskArns[0]' \
  --output text | awk -F/ '{print $NF}')

aws ecs execute-command \
  --cluster adma-cluster \
  --task $TASK_ID \
  --container backend \
  --interactive \
  --command "/bin/bash"
ECS Exec requires Session Manager plugin:
# macOS
brew install --cask session-manager-plugin

# Linux
curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb" -o "session-manager-plugin.deb"
sudo dpkg -i session-manager-plugin.deb

Next Steps

Build docs developers (and LLMs) love