Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/cadence-workflow/cadence/llms.txt

Use this file to discover all available pages before exploring further.

Service Overview

Cadence consists of four core services that work together to provide workflow orchestration capabilities. Each service is stateless, horizontally scalable, and has a specific set of responsibilities.

Frontend Service

The Frontend service acts as the API gateway for all client interactions with Cadence.

Responsibilities

  • API Gateway: Exposes public APIs for workflow and activity operations
  • Request Validation: Validates all incoming requests
  • Rate Limiting: Enforces per-domain rate limits
  • Authentication & Authorization: Handles security concerns
  • Cluster Redirection: Routes requests to appropriate cluster in multi-region setup
  • Domain Management: Handles domain registration and updates

Key Components

// Frontend service structure
type Service struct {
    Resource
    handler      *api.WorkflowHandler
    adminHandler admin.Handler
    config       *config.Config
}

API Layers

The frontend implements multiple decorator layers:
  1. Base Handler: Core API implementation
  2. Version Check: Ensures client compatibility
  3. Rate Limiter: Enforces quota limits
  4. Metrics: Captures telemetry data
  5. Cluster Redirection: Handles multi-cluster routing
  6. Access Control: Authorization enforcement

Configuration

services:
  frontend:
    rpc:
      port: 7933
      grpcPort: 7833
      bindOnLocalHost: true
      grpcMaxMsgSize: 33554432
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "cadence"

Rate Limiting

The frontend implements a sophisticated multi-stage rate limiting system:
// Rate limiter configuration
userRateLimiter := quotas.NewMultiStageRateLimiter(
    quotas.NewDynamicRateLimiter(s.config.UserRPS.AsFloat64()),
    collections.user,
)
Rate Limit Types:
  • User RPS: Rate limit for client API calls
  • Worker RPS: Rate limit for worker poll requests
  • Visibility RPS: Rate limit for visibility queries
  • Async RPS: Rate limit for async workflow operations

APIs Exposed

Workflow APIs

  • StartWorkflowExecution
  • SignalWorkflowExecution
  • TerminateWorkflowExecution
  • GetWorkflowExecutionHistory
  • DescribeWorkflowExecution

Domain APIs

  • RegisterDomain
  • DescribeDomain
  • UpdateDomain
  • ListDomains
  • DeprecateDomain

Task List APIs

  • PollForDecisionTask
  • PollForActivityTask
  • RespondDecisionTaskCompleted
  • RespondActivityTaskCompleted

History Service

The History service is the core workflow execution engine that maintains workflow state and makes execution decisions.

Responsibilities

  • Workflow State Management: Maintains mutable state for active workflows
  • Event History: Persists immutable workflow history events
  • Decision Processing: Processes decisions from workflow workers
  • Task Generation: Creates decision and activity tasks
  • Timer Management: Handles workflow and activity timeouts
  • Shard Ownership: Manages history shard ownership

Shard-Based Architecture

Key Components

// History service structure
type Service struct {
    Resource
    handler handler.Handler
    config  *config.Config
}

// History engine per shard
type Engine struct {
    shard            ShardContext
    executionManager persistence.ExecutionManager
    taskProcessor    processor.TaskProcessor
}

Workflow Execution State

The History service maintains two types of state:
  1. Mutable State: Current workflow execution state
    • Pending decision/activity tasks
    • Timers
    • Signals
    • Child workflows
    • Execution info (status, timeouts, etc.)
  2. Immutable History: Event log of all workflow actions
    • WorkflowExecutionStarted
    • DecisionTaskScheduled
    • ActivityTaskStarted
    • WorkflowExecutionCompleted

Task Queues

History service manages multiple task queues per shard:
// Task types processed by history
type TaskType int
const (
    TransferTaskType     // Immediate tasks (decisions, activities)
    TimerTaskType        // Delayed tasks (timeouts, retries)
    ReplicationTaskType  // Cross-DC replication tasks
)

Transfer Queue

  • Processes tasks that need immediate execution
  • Examples: Decision tasks, activity tasks, close execution
  • FIFO processing within each shard

Timer Queue

  • Processes time-based tasks
  • Examples: Workflow timeout, activity timeout, retry timer
  • Priority queue ordered by fire time

Configuration

services:
  history:
    rpc:
      port: 7934
      grpcPort: 7834
      grpcMaxMsgSize: 33554432

Scalability Considerations

  • Maximum Scale: Limited by numHistoryShards
  • Shard Distribution: Automatic rebalancing when instances join/leave
  • Graceful Shutdown: Drains shards before stopping

Matching Service

The Matching service routes tasks from History to Workers using task lists.

Responsibilities

  • Task List Management: Maintains task lists for decisions and activities
  • Task Routing: Delivers tasks to polling workers
  • Sync Match: Optimizes latency by matching tasks with waiting pollers
  • Task Persistence: Stores unmatched tasks in database
  • Load Balancing: Distributes tasks across available workers

Task List Architecture

Key Components

// Matching service structure
type Service struct {
    Resource
    handler handler.Handler
    config  *config.Config
}

// Task list manager
type taskListManager struct {
    taskListID   *tasklist.Identifier
    taskBuffer   chan *persistence.TaskInfo
    deliverBuffer chan *InternalTask
}

Sync Match Optimization

Sync Match: When a task arrives and workers are already polling
// Sync match flow
1. Worker sends PollForTask (long poll)
2. History service adds task
3. Matching checks for waiting pollers
4. If poller exists: deliver immediately (sync match)
5. If no poller: persist task to database
Benefits:
  • Near-zero latency task delivery
  • Reduced database load
  • Better throughput

Task List Types

  1. Decision Task List: Routes decision tasks
    • One per workflow task list
    • Workers poll for workflow execution decisions
  2. Activity Task List: Routes activity tasks
    • Can be different from decision task list
    • Workers poll for activity execution

Configuration

services:
  matching:
    rpc:
      port: 7935
      grpcPort: 7835
      grpcMaxMsgSize: 33554432

Scalability

  • Task List Partitioning: High-throughput task lists can be partitioned
  • Isolation Groups: Route tasks to specific worker pools
  • Dynamic Partitioning: Automatic partition adjustment based on load

Worker Service

The Worker service handles internal background processing tasks for the Cadence system.

Responsibilities

  • Replication: Processes cross-datacenter replication tasks
  • Indexing: Indexes workflow data to Elasticsearch/Pinot for visibility
  • Archival: Archives old workflow histories to blob storage
  • System Workflows: Runs internal system workflows
  • Domain Replication: Replicates domain metadata across clusters
The Worker service is not the same as application workers that execute workflow and activity code. This is an internal Cadence component.

Key Components

// Worker service structure
type Service struct {
    Resource
    config *Config
}

// Background processors
type processors struct {
    replicator         *replicator.Replicator
    indexer            *indexer.Indexer
    archiver           *archiver.Archiver
    scanner            *scanner.Scanner
    esAnalyzer         *esanalyzer.Analyzer
    failoverManager    *failovermanager.Manager
}

Replicator

Handles cross-cluster replication:
// Replicator processes replication tasks from Kafka
type Replicator struct {
    kafkaClient         messaging.Client
    historyClient       history.Client
    domainReplicator    domain.Replicator
}
Flow:
  1. History service writes replication tasks to Kafka
  2. Worker service consumes from Kafka
  3. Applies tasks to target cluster
  4. Handles conflict resolution

Indexer

Indexes workflow data for advanced visibility:
// Indexer processes visibility events
type Indexer struct {
    kafkaClient    messaging.Client
    esClient       elasticsearch.Client
    bulkProcessor  es.BulkProcessor
}
Configuration:
dynamicconfig:
  WorkerIndexerConcurrency: 100
  WorkerESProcessorBulkActions: 500
  WorkerESProcessorBulkSize: 2097152  # 2MB
  WorkerESProcessorFlushInterval: 1s

Archiver

Archives workflow histories to long-term storage:
archival:
  history:
    status: "enabled"
    enableRead: true
    provider:
      filestore:
        fileMode: "0666"
        dirMode: "0766"
Supported Providers:
  • Local filesystem
  • AWS S3
  • Google Cloud Storage
  • Custom implementations

Scanner

Performs data consistency checks and cleanup:
  • Task List Scanner: Removes orphaned task list entries
  • History Scanner: Validates workflow history integrity
  • Timer Scanner: Checks for stuck timers
  • Execution Scanner: Identifies zombie workflows

Configuration

services:
  worker:
    rpc:
      port: 7939
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "cadence"

System Domains

Worker service creates internal system domains:
  • cadence-system: Core system workflows
  • cadence-batcher: Batch operations
  • cadence-canary: Health checks

Inter-Service Communication

Communication Patterns

RPC Configuration

Protocols Supported:
  • gRPC (recommended)
  • TChannel (legacy)
Message Size Limits:
grpcMaxMsgSize: 33554432  # 32MB default

Service Discovery

Services discover each other using Ringpop:
ringpop:
  name: cadence
  bootstrapMode: hosts
  bootstrapHosts:
    - "127.0.0.1:7933"
    - "127.0.0.1:7934"
    - "127.0.0.1:7935"

Deployment Considerations

Resource Requirements

ServiceCPUMemoryDiskNetwork
FrontendLow-MediumLowMinimalHigh
HistoryHighHighMinimalMedium
MatchingLow-MediumLow-MediumMinimalMedium
WorkerMediumMediumLowMedium

Scaling Guidelines

  1. Frontend: Scale based on RPS
    • Start with 2-3 instances
    • Add instances as traffic increases
  2. History: Scale based on shard count
    • Each instance should own 100-200 shards
    • More instances = better distribution
  3. Matching: Scale based on task throughput
    • Start with 2-3 instances
    • Scale if sync match rate drops
  4. Worker: Scale based on background load
    • Replication lag
    • Indexing lag
    • Archival backlog

Next Steps

Persistence Layer

Learn about database design and configuration

Cross-DC Replication

Set up multi-region deployments

Build docs developers (and LLMs) love