Documentation Index Fetch the complete documentation index at: https://mintlify.com/temporalio/temporal/llms.txt
Use this file to discover all available pages before exploring further.
Temporal Server provides comprehensive monitoring capabilities through metrics, structured logging, and distributed tracing to observe cluster health and performance.
Metrics Collection
Temporal emits metrics using either Prometheus or StatsD backends. The metrics framework can be configured to use either Tally or OpenTelemetry.
Prometheus Configuration
Configure Prometheus metrics in your config.yaml:
global :
metrics :
prometheus :
framework : "opentelemetry" # or "tally"
listenAddress : "127.0.0.1:8000"
handlerPath : "/metrics"
loggerRPS : 0 # 0 means no limit
Framework Options:
tally - Legacy framework using uber-go/tally
opentelemetry - Modern OpenTelemetry-based metrics (recommended)
Configuration Fields:
listenAddress - Address where Prometheus scrapes metrics
handlerPath - HTTP endpoint path (default: /metrics)
loggerRPS - Rate limit for metric logger (0 = unlimited)
StatsD Configuration
For StatsD integration:
global :
metrics :
statsd :
framework : "opentelemetry"
hostPort : "127.0.0.1:8125"
prefix : "temporal"
flushInterval : "1s"
flushBytes : 1432
reporter :
tagSeparator : ":"
Configuration Fields:
hostPort - StatsD server address
prefix - Metric name prefix
flushInterval - Batch flush interval (default: 1s)
flushBytes - Maximum UDP packet size (default: 1432)
tagSeparator - Character to separate tags (optional)
Common Metrics Configuration
global :
metrics :
clientConfig :
tags :
environment : "production"
cluster : "us-east-1"
excludeTags :
namespace :
- "system-namespace" # whitelist only
prefix : "temporal_"
perUnitHistogramBoundaries :
dimensionless : [ 1 , 2 , 5 , 10 , 20 , 50 , 100 , 200 , 500 , 1000 ]
milliseconds : [ 1 , 2 , 5 , 10 , 20 , 50 , 100 , 200 , 500 , 1000 , 2000 , 5000 , 10000 ]
bytes : [ 1024 , 2048 , 4096 , 8192 , 16384 , 32768 , 65536 , 131072 , 262144 , 524288 ]
Options:
tags - Global tags added to all metrics
excludeTags - Filter sensitive tag values (replaced with _tag_excluded_)
prefix - Prefix for all metric names
perUnitHistogramBoundaries - Custom histogram buckets by unit type
withoutUnitSuffix - Remove unit suffixes (OpenTelemetry only)
withoutCounterSuffix - Remove _total suffix from counters (OpenTelemetry only)
recordTimerInSeconds - Emit timers in seconds instead of milliseconds
Key Metrics by Service
Service Health Metrics
These metrics track overall service health:
Service Request Metrics
Connection Metrics
service_requests # Total RPC requests received
service_pending_requests # Current pending requests (gauge)
service_errors # Unexpected service errors
service_error_with_type # Errors by error type
service_latency # Request latency
service_latency_nouserlatency # Server-side latency only
service_latency_userlatency # User workflow latency
Common Tags:
operation - API method name
service_role - Service type (frontend, history, matching, worker)
Persistence Layer Metrics
Track database operations:
# Shard Operations
GetOrCreateShard
UpdateShard
AssertShardOwnership
# Workflow Execution
CreateWorkflowExecution
GetWorkflowExecution
UpdateWorkflowExecution
DeleteWorkflowExecution
# Task Queue Operations
CreateTaskQueue
GetTaskQueue
UpdateTaskQueue
DeleteTaskQueue
# Task Operations
GetTransferTasks
CompleteTransferTask
GetTimerTasks
CompleteTimerTask
GetVisibilityTasks
GetReplicationTasks
# History Operations
AppendHistoryNodes
ReadHistoryBranch
DeleteHistoryBranch
Each persistence operation emits:
Request count
Error count
Latency histogram
db_kind tag (cassandra, mysql, postgres, sqlite)
History Service Metrics
# Core Operations
StartWorkflowExecution
RecordActivityTaskHeartbeat
RespondWorkflowTaskCompleted
RespondActivityTaskCompleted
# Shard Management
ShardController
ShardInfo
# Task Processing
TransferQueueProcessor
TimerQueueProcessor
VisibilityQueueProcessor
ArchivalQueueProcessor
OutboundQueueProcessor
# Cache Performance
HistoryCacheGetOrCreate
EventsCacheGetEvent
EventsCachePutEvent
Matching Service Metrics
PollWorkflowTaskQueue
PollActivityTaskQueue
AddActivityTask
AddWorkflowTask
TaskQueueMgr
TaskQueuePartitionManager
Authorization Metrics
service_authorization_latency # Authorization check duration
service_errors_unauthorized # Rejected requests
service_errors_authorize_failed # Authorization system errors
Tagged with:
namespace - Target namespace
operation - API being authorized
Error Tracking
Error metrics by type:
service_errors_invalid_argument
service_errors_namespace_not_active
service_errors_resource_exhausted
service_errors_entity_not_found
service_errors_execution_already_started
service_errors_context_timeout
service_errors_retry_task
service_errors_incomplete_history
service_errors_nondeterministic
Resource Metrics
Lock and Semaphore Usage
lock_requests # Lock acquisition attempts
lock_latency # Time waiting for locks
semaphore_requests # Semaphore acquisition attempts
semaphore_failures # Failed semaphore acquisitions
semaphore_latency # Time waiting for semaphore
Cache Metrics
NamespaceCache
EventsCacheGetEvent
EventsCachePutEvent
EventsCacheGetFromStore
VersionMembershipCacheGet
VersionMembershipCachePut
Tagged with cache_type:
mutablestate
events
version_membership
routing_info
TLS Certificate Monitoring
certificates_expired # Number of expired certificates (gauge)
certificates_expiring # Number of certificates expiring soon (gauge)
Configure certificate monitoring:
global :
tls :
expirationChecks :
warningWindow : "720h" # 30 days
errorWindow : "168h" # 7 days
checkInterval : "1h"
Alerting Guidelines
Critical Alerts
Set up alerts for:
Service Availability
rate(service_errors[5m]) > 0.05 # 5% error rate
service_pending_requests > 1000
Persistence Layer
rate(UpdateShard_errors[1m]) > 0
histogram_quantile(0.99, rate(GetWorkflowExecution_latency[5m])) > 1000
Shard Health
rate(ShardController_errors[5m]) > 0
Certificate Expiration
certificates_expired > 0
certificates_expiring > 0
Warning Alerts
High Latency
histogram_quantile(0.95, rate(service_latency[5m])) > 500
Resource Pressure
semaphore_failures > 100
lock_latency > 100
Cache Efficiency
rate(EventsCacheGetFromStore[5m]) / rate(EventsCacheGetEvent[5m]) > 0.5
Logging Configuration
Configure structured logging:
log :
stdout : true
level : "info" # debug, info, warn, error
outputFile : "/var/log/temporal/server.log"
encoding : "json" # json or console
Log Levels:
debug - Detailed diagnostic information
info - General operational events
warn - Warning messages, degraded state
error - Error events, requires attention
Important Log Tags:
namespace - Namespace name
workflowID - Workflow execution ID
runID - Workflow run ID
operation - Operation being performed
error - Error details
shard-id - History shard ID
Health Checks
Temporal exposes health check endpoints:
# Frontend health
curl http://localhost:7233/health
# Service-specific health
curl http://localhost:7234/health # History
curl http://localhost:7235/health # Matching
Distributed Tracing
Enable OpenTelemetry tracing:
global :
otel :
enabled : true
exporters :
- type : "otlp"
endpoint : "otel-collector:4317"
insecure : false
headers :
api-key : "your-api-key"
Tracing captures:
Request flow across services
Persistence operation timing
Cross-namespace operations
Replication latency
Dashboard Recommendations
Service Overview Dashboard
Request rate by service and operation
Error rate and types
Latency percentiles (p50, p95, p99)
Active connections
Persistence Dashboard
Operation latency by type
Error rates by operation
Connection pool utilization
Query duration
Workflow Execution Dashboard
Workflow start rate
Workflow completion rate
Task queue backlog
Activity timeouts
Resource Usage Dashboard
CPU and memory per service
Lock contention
Cache hit rates
GC pause time
See Also