Email sending must be reliable because client applications depend on Datamailer for email verification, password reset, course notifications, and campaigns. This guide covers the monitoring signals to watch, the throttling controls available, the idempotency rules all workers must follow, and the recovery procedures available when things go wrong.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/DataTalksClub/datamailer/llms.txt
Use this file to discover all available pages before exploring further.
Monitoring Checklist
Every environment should have CloudWatch alarms and a dashboard covering the following signals. The CloudFormation template atinfra/cloudformation/datamailer-mvp.json codifies these alarms.
SES Delivery
- Bounce rate (alarm if rising toward SES account threshold)
- Complaint rate (alarm immediately on any meaningful rise)
- Send failures from SES API errors
SQS Queues
- Age of oldest message per queue (transactional queue especially)
- DLQ depth per queue (any depth above zero should alert)
Lambda Workers
- Error count and error rate per worker function
- Throttle count (indicates concurrency limits are being hit)
- Duration approaching the function timeout
Postgres
- CPU utilization
- Storage space remaining
- Active connection count
- Slow query log entries
Campaigns
- Campaigns stuck in
sendingstatus longer than expected - Transactional email queued longer than acceptable latency threshold
Web Host
/health/endpoint returning non-200- Gunicorn error rate and response latency
Throttling Controls
SES accounts have a daily sending quota and a per-second send rate. Lambda can scale faster than SES accepts. The following controls are applied in combination to stay within SES limits.- Lambda reserved concurrency — hard cap on the number of concurrent Lambda executions per worker. Start with transactional 4, campaign 2, SES webhooks 2, email events 1.
- SQS event-source maximum concurrency — limits how many Lambda instances the SQS event-source mapping will invoke simultaneously, independent of reserved concurrency.
- Small batch sizes — keep the number of
campaign_recipient_idsper SQS message small so each Lambda invocation does bounded work. - Per-client limits — apply app-level token bucket or per-client/audience campaign limits if a single client’s volume risks the overall account send rate.
Transactional and campaign queues must have separate concurrency controls. A newsletter campaign blast must not delay password reset or email verification messages. If a campaign is overwhelming the system, lower
campaign-email concurrency first — never touch transactional-email concurrency in response to campaign pressure.Idempotency Rules
SQS is at-least-once delivery. Every worker must be idempotent — a duplicate message must either no-op or converge on the same database state.Campaign Recipients
Before sending, load thecampaign_recipients row from Postgres and check its current status. Rows in any terminal state (sent, skipped, bounced, complained) must be acknowledged without another SES API call.
Transactional Messages
Before sending, load thetransactional_messages row and check the (client_id, idempotency_key) pair. If a message with the same pair has already reached a terminal state, acknowledge the job without another SES call.
Tracking Events
Tracking events (opens, clicks, unsubscribes) may be appended toemail_events more than once when product policy allows. However, summary fields — first open timestamp, unique click flag, unsubscribe state — must use the idempotency_key, tracking_token, and source row IDs to distinguish total counts from unique counts and avoid double-counting.
SES Webhook Events
Deduplicate SES webhook events byprovider_event_id when present. The webhook processor must correlate ses_message_id to campaign_recipients or transactional_messages before appending email_events or updating summary columns.
Postgres Connection Management
Lambda concurrency can exhaust Postgres connections quickly if unchecked. Apply these controls from the start:- Conservative concurrency at launch (see throttling controls above).
- Short database transactions — open a connection, do the work, close it promptly. Avoid holding connections across SES API calls.
- Small batch sizes — limit per-invocation connection hold time.
- Separate credentials —
datamailer_appfor the web host,datamailer_workerfor Lambda, with least privilege for each. - RDS Proxy — add when sustained Lambda worker pressure triggers
DatabaseConnectionsalarms or connection wait errors. Lower event-source maximum concurrency first as the immediate mitigation.
Queue Cost
SQS queue cost is not a meaningful cost driver relative to SES delivery and database costs. For reference:- Idle polling — with 4 queues, 2 Lambda pollers per queue, and 20-second long polling, approximately 1,071,360 receive requests/month are generated. After the 1M request free tier, this costs roughly 0.67/month.
- Campaign sends — at 720,000 campaign emails/month with 10 messages per SQS batch, the combined SendMessageBatch, ReceiveMessage, and DeleteMessage requests total approximately 216,000/month, adding roughly $0.09/month before the free tier.
Recovery Procedures
Retry Failed Campaign Recipients
- Pause the campaign in the product UI or Django admin to stop new send attempts.
- Investigate the root cause — check Lambda logs, DLQ messages, and SES error responses.
- Fix the root cause before retrying.
- Re-enqueue
campaign-emailmessages only for recipients still in a non-terminal state (pendingorfailed). Idempotency prevents re-sending rows already markedsent. - Resume the campaign and confirm the
campaign-emailqueue drains and the DLQ stays empty.
Replay DLQ Messages
- Identify the DLQ with the active alarm.
- Sample messages using
aws sqs receive-messagewithout deleting them — inspect the body and check Lambda logs for the matchingmessageIdoridempotency_key. - Fix the root cause before replaying anything.
- Replay by sending the same message body back to the source queue, then delete the DLQ copy.
- For campaign jobs, verify recipient state in Postgres first so idempotency prevents duplicate sends.
Recompute Campaign Aggregate Counters
If campaign stats diverge from the underlyingcampaign_recipients and email_events tables (e.g. after a DLQ replay or partial batch failure), recompute aggregate counters from the source rows. This is a read-then-write operation and is safe to run multiple times.
Manually Suppress a Contact
If theses-webhooks worker is degraded or DLQ messages require replay, it may be necessary to manually suppress a contact to prevent further sends to a bounced or complained address. Add a suppression record directly through the Django admin or via a management command before resuming sends.
Pause and Resume a Campaign
- Pause the campaign through the product UI or Django admin.
- If queue pressure continues, disable or reduce
campaign-emailevent-source maximum concurrency. - Investigate failed recipients and DLQ messages.
- Resume only
pendingorfailedrecipients after the root cause is confirmed fixed. - Confirm the
transactional-emailqueue age stayed below its alarm threshold throughout the campaign incident — transactional email must not have been delayed.
Local Development with LocalStack
For local development, Datamailer supports LocalStack to emulate SQS and SES without real AWS credentials.docker-compose.yml aws-local profile starts LocalStack on port 4566 with SQS and SES services enabled. Point the application at LocalStack by setting AWS_ENDPOINT_URL=http://localhost:4566 in your .env file alongside the standard SQS_*_QUEUE_URL environment variables.