Monitoring and Observability

Temporal Server provides comprehensive monitoring capabilities through metrics, structured logging, and distributed tracing to observe cluster health and performance.

Metrics Collection

Temporal emits metrics using either Prometheus or StatsD backends. The metrics framework can be configured to use either Tally or OpenTelemetry.

Prometheus Configuration

Configure Prometheus metrics in your config.yaml:

global:
  metrics:
    prometheus:
      framework: "opentelemetry"  # or "tally"
      listenAddress: "127.0.0.1:8000"
      handlerPath: "/metrics"
      loggerRPS: 0  # 0 means no limit

Framework Options:

tally - Legacy framework using uber-go/tally
opentelemetry - Modern OpenTelemetry-based metrics (recommended)

Configuration Fields:

listenAddress - Address where Prometheus scrapes metrics
handlerPath - HTTP endpoint path (default: /metrics)
loggerRPS - Rate limit for metric logger (0 = unlimited)

StatsD Configuration

For StatsD integration:

global:
  metrics:
    statsd:
      framework: "opentelemetry"
      hostPort: "127.0.0.1:8125"
      prefix: "temporal"
      flushInterval: "1s"
      flushBytes: 1432
      reporter:
        tagSeparator: ":"

Configuration Fields:

hostPort - StatsD server address
prefix - Metric name prefix
flushInterval - Batch flush interval (default: 1s)
flushBytes - Maximum UDP packet size (default: 1432)
tagSeparator - Character to separate tags (optional)

Common Metrics Configuration

global:
  metrics:
    clientConfig:
      tags:
        environment: "production"
        cluster: "us-east-1"
      excludeTags:
        namespace:
          - "system-namespace"  # whitelist only
      prefix: "temporal_"
      perUnitHistogramBoundaries:
        dimensionless: [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]
        milliseconds: [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000]
        bytes: [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288]

Options:

tags - Global tags added to all metrics
excludeTags - Filter sensitive tag values (replaced with _tag_excluded_)
prefix - Prefix for all metric names
perUnitHistogramBoundaries - Custom histogram buckets by unit type
withoutUnitSuffix - Remove unit suffixes (OpenTelemetry only)
withoutCounterSuffix - Remove _total suffix from counters (OpenTelemetry only)
recordTimerInSeconds - Emit timers in seconds instead of milliseconds

Key Metrics by Service

Service Health Metrics

These metrics track overall service health:

service_requests              # Total RPC requests received
service_pending_requests      # Current pending requests (gauge)
service_errors                # Unexpected service errors
service_error_with_type       # Errors by error type
service_latency               # Request latency
service_latency_nouserlatency # Server-side latency only
service_latency_userlatency   # User workflow latency

Common Tags:

operation - API method name
service_role - Service type (frontend, history, matching, worker)

Persistence Layer Metrics

Track database operations:

# Shard Operations
GetOrCreateShard
UpdateShard
AssertShardOwnership

# Workflow Execution
CreateWorkflowExecution
GetWorkflowExecution
UpdateWorkflowExecution
DeleteWorkflowExecution

# Task Queue Operations
CreateTaskQueue
GetTaskQueue
UpdateTaskQueue
DeleteTaskQueue

# Task Operations
GetTransferTasks
CompleteTransferTask
GetTimerTasks
CompleteTimerTask
GetVisibilityTasks
GetReplicationTasks

# History Operations
AppendHistoryNodes
ReadHistoryBranch
DeleteHistoryBranch

Each persistence operation emits:

Request count
Error count
Latency histogram
db_kind tag (cassandra, mysql, postgres, sqlite)

History Service Metrics

# Core Operations
StartWorkflowExecution
RecordActivityTaskHeartbeat
RespondWorkflowTaskCompleted
RespondActivityTaskCompleted

# Shard Management
ShardController
ShardInfo

# Task Processing
TransferQueueProcessor
TimerQueueProcessor
VisibilityQueueProcessor
ArchivalQueueProcessor
OutboundQueueProcessor

# Cache Performance
HistoryCacheGetOrCreate
EventsCacheGetEvent
EventsCachePutEvent

Matching Service Metrics

PollWorkflowTaskQueue
PollActivityTaskQueue
AddActivityTask
AddWorkflowTask
TaskQueueMgr
TaskQueuePartitionManager

Authorization Metrics

service_authorization_latency      # Authorization check duration
service_errors_unauthorized        # Rejected requests
service_errors_authorize_failed    # Authorization system errors

Tagged with:

namespace - Target namespace
operation - API being authorized

Error Tracking

Error metrics by type:

service_errors_invalid_argument
service_errors_namespace_not_active
service_errors_resource_exhausted
service_errors_entity_not_found
service_errors_execution_already_started
service_errors_context_timeout
service_errors_retry_task
service_errors_incomplete_history
service_errors_nondeterministic

Resource Metrics

Lock and Semaphore Usage

lock_requests        # Lock acquisition attempts
lock_latency         # Time waiting for locks
semaphore_requests   # Semaphore acquisition attempts
semaphore_failures   # Failed semaphore acquisitions
semaphore_latency    # Time waiting for semaphore

Cache Metrics

NamespaceCache
EventsCacheGetEvent
EventsCachePutEvent
EventsCacheGetFromStore
VersionMembershipCacheGet
VersionMembershipCachePut

Tagged with cache_type:

mutablestate
events
version_membership
routing_info

TLS Certificate Monitoring

certificates_expired   # Number of expired certificates (gauge)
certificates_expiring  # Number of certificates expiring soon (gauge)

Configure certificate monitoring:

global:
  tls:
    expirationChecks:
      warningWindow: "720h"  # 30 days
      errorWindow: "168h"    # 7 days
      checkInterval: "1h"

Alerting Guidelines

Critical Alerts

Set up alerts for:

Service Availability

rate(service_errors[5m]) > 0.05  # 5% error rate
service_pending_requests > 1000

Persistence Layer

rate(UpdateShard_errors[1m]) > 0
histogram_quantile(0.99, rate(GetWorkflowExecution_latency[5m])) > 1000

Shard Health
```
rate(ShardController_errors[5m]) > 0
```

Certificate Expiration

certificates_expired > 0
certificates_expiring > 0

Warning Alerts

High Latency

histogram_quantile(0.95, rate(service_latency[5m])) > 500

Resource Pressure

semaphore_failures > 100
lock_latency > 100

Cache Efficiency

rate(EventsCacheGetFromStore[5m]) / rate(EventsCacheGetEvent[5m]) > 0.5

Logging Configuration

Configure structured logging:

log:
  stdout: true
  level: "info"  # debug, info, warn, error
  outputFile: "/var/log/temporal/server.log"
  encoding: "json"  # json or console

Log Levels:

debug - Detailed diagnostic information
info - General operational events
warn - Warning messages, degraded state
error - Error events, requires attention

Important Log Tags:

namespace - Namespace name
workflowID - Workflow execution ID
runID - Workflow run ID
operation - Operation being performed
error - Error details
shard-id - History shard ID

Health Checks

Temporal exposes health check endpoints:

# Frontend health
curl http://localhost:7233/health

# Service-specific health
curl http://localhost:7234/health  # History
curl http://localhost:7235/health  # Matching

Distributed Tracing

Enable OpenTelemetry tracing:

global:
  otel:
    enabled: true
    exporters:
      - type: "otlp"
        endpoint: "otel-collector:4317"
        insecure: false
        headers:
          api-key: "your-api-key"

Tracing captures:

Request flow across services
Persistence operation timing
Cross-namespace operations
Replication latency

Dashboard Recommendations

Service Overview Dashboard

Request rate by service and operation
Error rate and types
Latency percentiles (p50, p95, p99)
Active connections

Persistence Dashboard

Operation latency by type
Error rates by operation
Connection pool utilization
Query duration

Workflow Execution Dashboard

Workflow start rate
Workflow completion rate
Task queue backlog
Activity timeouts

Resource Usage Dashboard

CPU and memory per service
Lock contention
Cache hit rates
GC pause time

Getting Started

Core Concepts

Deployment

Architecture

Operations

Advanced Features

Monitoring and Observability

Metrics Collection

Prometheus Configuration

StatsD Configuration

Common Metrics Configuration

Key Metrics by Service

Service Health Metrics

Persistence Layer Metrics

History Service Metrics

Matching Service Metrics

Authorization Metrics

Error Tracking

Resource Metrics

Lock and Semaphore Usage

Cache Metrics

TLS Certificate Monitoring

Alerting Guidelines

Critical Alerts

Warning Alerts

Logging Configuration

Health Checks

Distributed Tracing

Dashboard Recommendations

Service Overview Dashboard

Persistence Dashboard

Workflow Execution Dashboard

Resource Usage Dashboard

See Also

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Deployment

Architecture

Operations

Advanced Features

Documentation Index

​Metrics Collection

​Prometheus Configuration

​StatsD Configuration

​Common Metrics Configuration

​Key Metrics by Service

​Service Health Metrics

​Persistence Layer Metrics

​History Service Metrics

​Matching Service Metrics

​Authorization Metrics

​Error Tracking

​Resource Metrics

​Lock and Semaphore Usage

​Cache Metrics

​TLS Certificate Monitoring

​Alerting Guidelines

​Critical Alerts

​Warning Alerts

​Logging Configuration

​Health Checks

​Distributed Tracing

​Dashboard Recommendations

​Service Overview Dashboard

​Persistence Dashboard

​Workflow Execution Dashboard

​Resource Usage Dashboard

​See Also

Build docs developers (and LLMs) love

Metrics Collection

Prometheus Configuration

StatsD Configuration

Common Metrics Configuration

Key Metrics by Service

Service Health Metrics

Persistence Layer Metrics

History Service Metrics

Matching Service Metrics

Authorization Metrics

Error Tracking

Resource Metrics

Lock and Semaphore Usage

Cache Metrics

TLS Certificate Monitoring

Alerting Guidelines

Critical Alerts

Warning Alerts

Logging Configuration

Health Checks

Distributed Tracing

Dashboard Recommendations

Service Overview Dashboard

Persistence Dashboard

Workflow Execution Dashboard

Resource Usage Dashboard

See Also