Documentation Index
Fetch the complete documentation index at: https://mintlify.com/temporalio/temporal/llms.txt
Use this file to discover all available pages before exploring further.
Temporal Server uses Dead Letter Queues (DLQs) to handle tasks that fail processing after exhausting retry attempts. This prevents poison messages from blocking queue progress.
Overview
DLQs store failed tasks for:
- Replication Tasks - Cross-cluster replication failures
- History Tasks - Transfer, timer, visibility, archival task failures
Tasks move to DLQ when:
- Processing fails repeatedly
- Permanent error detected (e.g., corrupted data)
- Target namespace deleted
DLQ Types
Replication DLQ
Stores failed cross-cluster replication tasks:
- Namespace replication events
- Workflow history replication
- Task queue replication
Location: Per source cluster, per target cluster
History Task DLQ
Stores failed internal task processing:
- Transfer tasks (cross-workflow operations)
- Timer tasks (scheduled operations)
- Visibility tasks (search indexing)
- Archival tasks (long-term storage)
Location: Per history shard, per task category
Monitoring DLQs
DLQ Metrics
dlq_message_count
Type: Gauge
Description: Number of messages in DLQ by task category
Tags: task_category
Update Frequency: Every 3 hours from shard 1 owner
# View DLQ messages by category
dlq_message_count
# Alert on non-zero DLQ count
dlq_message_count > 0
Persistence Metrics
PersistenceEnqueueMessageToDLQ # Tasks moved to DLQ
PersistenceReadMessagesFromDLQ # DLQ reads
PersistenceDeleteMessageFromDLQ # Individual message deletion
PersistenceRangeDeleteMessagesFromDLQ # Bulk deletion
Task Categories
transfer # Transfer tasks (activities, child workflows, signals)
timer # Timer tasks (timeouts, retries, scheduled events)
visibility # Visibility updates (search indexing)
archival # Archival tasks (history upload)
Inspecting DLQ
Replication DLQ
List DLQ Messages
tctl admin dlq read \
--cluster source-cluster \
--namespace my-namespace
Output:
[
{
"taskId": 12345,
"taskType": "HistoryReplicationTask",
"namespaceId": "abc-123",
"workflowId": "my-workflow",
"runId": "run-456",
"firstEventId": 1,
"nextEventId": 10,
"version": 5
}
]
Get DLQ Size
tctl admin dlq count \
--cluster source-cluster \
--namespace my-namespace
History Task DLQ
List Messages by Shard and Category
tctl admin queue dlq read \
--shard-id 1 \
--category transfer
Count Messages
tctl admin queue dlq count \
--shard-id 1 \
--category transfer
Recovering from DLQ
Replication DLQ
Merge Single Task
Reprocess one failed task:
tctl admin dlq merge \
--cluster source-cluster \
--namespace my-namespace \
--task-id 12345
Merge All Tasks
Reprocess all DLQ messages:
tctl admin dlq merge \
--cluster source-cluster \
--namespace my-namespace
Note: Large DLQs may take time to process. Monitor progress with count command.
Merge with Filtering
# Merge tasks for specific workflow
tctl admin dlq merge \
--cluster source-cluster \
--namespace my-namespace \
--workflow-id my-workflow
History Task DLQ
Merge Tasks by Category
tctl admin queue dlq merge \
--shard-id 1 \
--category transfer
Merge Specific Task
tctl admin queue dlq merge \
--shard-id 1 \
--category transfer \
--min-message-id 12345 \
--max-message-id 12345
Merge Range of Tasks
tctl admin queue dlq merge \
--shard-id 1 \
--category transfer \
--min-message-id 10000 \
--max-message-id 20000
Purging DLQ
Warning: Purging permanently deletes tasks. Only use if tasks cannot be recovered or are no longer needed.
Replication DLQ
Delete Single Task
tctl admin dlq purge \
--cluster source-cluster \
--namespace my-namespace \
--task-id 12345
Delete All Tasks
tctl admin dlq purge \
--cluster source-cluster \
--namespace my-namespace
History Task DLQ
Purge by Category
tctl admin queue dlq purge \
--shard-id 1 \
--category transfer
Purge Range
tctl admin queue dlq purge \
--shard-id 1 \
--category transfer \
--min-message-id 10000 \
--max-message-id 20000
Common DLQ Scenarios
Scenario 1: Namespace Deleted on Target Cluster
Cause: Replication tasks for non-existent namespace
Solution:
Recreate Namespace
Purge Tasks
# Recreate namespace on target cluster
tctl --cluster target-cluster namespace register my-namespace \
--clusters source-cluster target-cluster \
--active-cluster source-cluster
# Merge DLQ tasks
tctl admin dlq merge \
--cluster source-cluster \
--namespace my-namespace
# If namespace no longer needed, purge DLQ
tctl admin dlq purge \
--cluster source-cluster \
--namespace my-namespace
Scenario 2: Corrupted Task Data
Cause: Data corruption or schema mismatch
Solution:
# Read task details
tctl admin dlq read \
--cluster source-cluster \
--namespace my-namespace
# If corrupted, purge specific task
tctl admin dlq purge \
--cluster source-cluster \
--namespace my-namespace \
--task-id 12345
# Investigate root cause in logs
kubectl logs -l app=temporal-history | grep "task-id=12345"
Scenario 3: Transient Target Cluster Outage
Cause: Target cluster was unavailable during replication
Solution:
# Wait for target cluster to recover
# Verify target cluster health
tctl --cluster target-cluster cluster health
# Merge all DLQ tasks
tctl admin dlq merge \
--cluster source-cluster \
--namespace my-namespace
Scenario 4: High Volume of Transfer Tasks in DLQ
Cause: Downstream service failures or rate limiting
Solution:
# Check for pattern in failed tasks
tctl admin queue dlq read \
--shard-id 1 \
--category transfer | jq '.[] | .taskType' | sort | uniq -c
# Fix underlying issue (e.g., scale matching service)
# Merge DLQ in batches
for i in {1..10}; do
tctl admin queue dlq merge \
--shard-id $i \
--category transfer
done
Scenario 5: Visibility Task Failures
Cause: Elasticsearch indexing errors
Solution:
# Check Elasticsearch health
curl http://elasticsearch:9200/_cluster/health
# Recreate index if corrupted
curl -X DELETE http://elasticsearch:9200/temporal_visibility_v1
temporal-elasticsearch-setup --version v1
# Merge visibility DLQ
for shard in {1..4096}; do
tctl admin queue dlq merge \
--shard-id $shard \
--category visibility
done
Automation
Automated DLQ Monitoring
Create monitoring script:
#!/bin/bash
# dlq-monitor.sh
CLUSTER="source-cluster"
NAMESPACES=("production" "staging")
for ns in "${NAMESPACES[@]}"; do
count=$(tctl admin dlq count --cluster $CLUSTER --namespace $ns 2>/dev/null | grep -oP '\d+')
if [ "$count" -gt 0 ]; then
echo "WARNING: DLQ for namespace $ns has $count messages"
# Alert to PagerDuty, Slack, etc.
fi
done
Run via cron:
*/15 * * * * /usr/local/bin/dlq-monitor.sh
Automated DLQ Merge
Auto-merge after transient failures:
#!/bin/bash
# dlq-auto-merge.sh
SHARD_COUNT=4096
CATEGORIES=("transfer" "timer" "visibility")
THRESHOLD=100 # Auto-merge if less than threshold
for shard in $(seq 1 $SHARD_COUNT); do
for category in "${CATEGORIES[@]}"; do
count=$(tctl admin queue dlq count \
--shard-id $shard \
--category $category 2>/dev/null | grep -oP '\d+')
if [ "$count" -gt 0 ] && [ "$count" -lt $THRESHOLD ]; then
echo "Auto-merging $count tasks from shard $shard category $category"
tctl admin queue dlq merge \
--shard-id $shard \
--category $category
fi
done
done
Dynamic Configuration
Tune DLQ behavior:
# config/dynamicconfig/production.yaml
# Max retries before moving to DLQ
history.transferProcessorMaxRetryCount:
- value: 100
constraints: {}
history.timerProcessorMaxRetryCount:
- value: 100
constraints: {}
# DLQ message batch size
history.dlqMaxMessageCount:
- value: 1000
constraints: {}
# Enable DLQ metrics emission
worker.dlqMetricsEmitterEnabled:
- value: true
constraints: {}
Best Practices
1. Monitor DLQ Size
Set up alerts:
2. Investigate Before Purging
Always inspect tasks before deletion:
tctl admin dlq read --cluster source-cluster --namespace my-namespace | less
3. Fix Root Cause
DLQ is a symptom, not the problem:
- Check target cluster health
- Verify namespace configuration
- Review error logs
- Check resource availability
4. Merge in Batches
For large DLQs, process incrementally:
# Process 1000 tasks at a time
for i in {0..10}; do
start=$((i * 1000))
end=$(((i + 1) * 1000))
tctl admin queue dlq merge \
--shard-id 1 \
--category transfer \
--min-message-id $start \
--max-message-id $end
sleep 60 # Pause between batches
done
5. Regular DLQ Audits
Schedule weekly reviews:
# Weekly DLQ report
for ns in $(tctl namespace list | grep Name | awk '{print $2}'); do
count=$(tctl admin dlq count --cluster source-cluster --namespace $ns 2>/dev/null)
echo "$ns: $count"
done
Troubleshooting
DLQ Metrics Not Updating
Cause: Metrics emitted only by shard 1 owner
Solution:
# Find shard 1 owner
tctl admin shard describe --shard-id 1
# Check logs on that host
kubectl logs <pod-name> | grep DLQMetricsEmitter
Cannot Read DLQ
Cause: Insufficient permissions or wrong cluster
Solution:
# Verify cluster connectivity
tctl cluster health
# Check admin permissions
tctl admin cluster describe
Merge Fails
Cause: Tasks still failing for same reason
Solution:
# Check recent errors
tctl admin queue dlq read --shard-id 1 --category transfer
# Review history service logs
kubectl logs -l app=temporal-history --tail=1000 | grep DLQ
# Fix underlying issue before retrying merge
See Also