Documentation Index
Fetch the complete documentation index at: https://mintlify.com/cadence-workflow/cadence/llms.txt
Use this file to discover all available pages before exploring further.
Isolation Groups provide zone-aware routing of workflow tasks to workers, enabling fault isolation, disaster recovery, and efficient resource utilization across availability zones or data centers.
Overview
Isolation Groups enable:
- Zone Awareness: Route tasks to workers in specific zones
- Fault Isolation: Contain failures to specific zones
- Graceful Draining: Drain zones for maintenance without workflow impact
- Load Balancing: Distribute load across healthy zones
- Multi-Region Support: Route tasks across geographic regions
Concepts
Isolation Group
A logical grouping of workers, typically corresponding to:
- Availability zone (e.g.,
us-east-1a, us-east-1b)
- Data center
- Kubernetes cluster
- Worker deployment
Drain State
Zones can be in three states:
- Healthy: Actively receives new tasks
- Draining: No new tasks, existing workflows continue
- Drained: No tasks routed to this zone
Task List Partitioning
Task lists are partitioned across isolation groups for parallel processing and isolation.
Configuration
Global Isolation Groups
Define isolation groups at the cluster level:
# Get current isolation groups
cadence admin cluster get-isolation-groups
# Set isolation groups (JSON format)
cadence admin cluster update-global-isolation-groups \
--json '{
"isolationGroups": [
{"name": "us-east-1a", "state": "HEALTHY"},
{"name": "us-east-1b", "state": "HEALTHY"},
{"name": "us-east-1c", "state": "HEALTHY"}
]
}'
Domain-Specific Isolation
Apply isolation to specific domains:
# Update domain with isolation groups
cadence admin domain update-isolation-groups \
--domain my-domain \
--json '{
"isolationGroups": [
{"name": "zone-1", "state": "HEALTHY"},
{"name": "zone-2", "state": "DRAINING"},
{"name": "zone-3", "state": "HEALTHY"}
]
}'
# Get domain isolation groups
cadence admin domain get-isolation-groups --domain my-domain
Worker Configuration
Workers must identify their isolation group:
import (
"go.uber.org/cadence/client"
"go.uber.org/cadence/worker"
)
func main() {
// Create service client
c, _ := client.Dial(&client.Options{
HostPort: "cadence-frontend:7933",
})
// Create worker with isolation group
w := worker.New(c, "my-task-list", worker.Options{
Identity: "worker-1",
IsolationGroup: "us-east-1a", // Set zone
})
// Register and start worker
w.Start()
}
Java SDK:
WorkerOptions options = WorkerOptions.newBuilder()
.setIdentity("worker-1")
.setIsolationGroup("us-east-1a")
.build();
Worker worker = workerFactory.newWorker("my-task-list", options);
Use Cases
Availability Zone Isolation
Isolate workers by AWS availability zone:
# 3 availability zones
isolationGroups:
- name: "us-east-1a"
state: "HEALTHY"
- name: "us-east-1b"
state: "HEALTHY"
- name: "us-east-1c"
state: "HEALTHY"
Benefits:
- AZ failure contained to that zone’s workflows
- New workflows distributed to healthy AZs
- Ongoing workflows in failed AZ can be recovered
Graceful Zone Draining
Drain zone for maintenance:
# Step 1: Mark zone as draining
cadence admin cluster update-global-isolation-groups \
--json '{
"isolationGroups": [
{"name": "us-east-1a", "state": "DRAINING"}, # No new tasks
{"name": "us-east-1b", "state": "HEALTHY"},
{"name": "us-east-1c", "state": "HEALTHY"}
]
}'
# Step 2: Wait for in-flight workflows to complete
# Monitor: cadence --do domain workflow list --open
# Step 3: Shut down workers in us-east-1a
kubectl scale deployment worker-us-east-1a --replicas=0
# Step 4: Perform maintenance
# Step 5: Restore zone
cadence admin cluster update-global-isolation-groups \
--json '{
"isolationGroups": [
{"name": "us-east-1a", "state": "HEALTHY"},
{"name": "us-east-1b", "state": "HEALTHY"},
{"name": "us-east-1c", "state": "HEALTHY"}
]
}'
Multi-Region Deployment
Route tasks to specific regions:
isolationGroups:
- name: "us-west-2"
state: "HEALTHY"
- name: "eu-west-1"
state: "HEALTHY"
- name: "ap-southeast-1"
state: "HEALTHY"
Worker deployment:
# US region workers
ISOLATION_GROUP=us-west-2 ./worker start
# EU region workers
ISOLATION_GROUP=eu-west-1 ./worker start
# Asia region workers
ISOLATION_GROUP=ap-southeast-1 ./worker start
Canary Deployments
Gradually roll out new worker versions:
# Initial state: all traffic to stable
isolationGroups:
- name: "stable"
state: "HEALTHY"
- name: "canary"
state: "DRAINING" # No traffic initially
# Step 1: Route small percentage to canary
isolationGroups:
- name: "stable"
state: "HEALTHY"
- name: "canary"
state: "HEALTHY" # Start receiving traffic
# Step 2: Monitor canary metrics
# If healthy, drain stable
# Step 3: Switch to canary
isolationGroups:
- name: "stable"
state: "DRAINING"
- name: "canary"
state: "HEALTHY"
# Step 4: Complete migration
isolationGroups:
- name: "canary"
state: "HEALTHY"
Task Routing
Load Balancing Algorithm
Tasks are distributed across healthy isolation groups:
- Filter: Exclude draining/drained groups for new workflows
- Balance: Distribute tasks evenly across healthy groups
- Sticky: Existing workflows stay in their group if healthy
- Fallback: Redirect if group becomes unhealthy
Partition Configuration
Control partitioning with dynamic config:
matching.isolationGroupPartitions:
- value: 3
constraints:
domainName: "my-domain"
taskListName: "my-task-list"
More partitions = better parallelism but more coordination overhead.
Sticky Task Lists
Isolation groups work with sticky task lists:
- Sticky task lists remain in their isolation group
- Group health affects sticky routing
- Draining groups stop receiving new sticky tasks
Monitoring
Key Metrics
Task Distribution:
# Tasks per isolation group
sum by (isolation_group) (
rate(cadence_matching_tasks_dispatched[5m])
)
# Isolation group health
cadence_isolation_group_health{isolation_group="us-east-1a"}
Drain Progress:
# Open workflows in draining group
sum(
cadence_workflows_open{
isolation_group="us-east-1a",
state="draining"
}
)
CLI Monitoring
# List isolation groups
cadence admin cluster get-isolation-groups
# Check domain-specific groups
cadence admin domain get-isolation-groups --domain my-domain
# View task list by isolation group
cadence --do my-domain tasklist describe \
--tl my-task-list \
--tlt decision
Best Practices
Deployment Strategy
- Start with 3 Zones: Balance availability and complexity
- Use Existing Infrastructure: Align with AZ/region boundaries
- Test Draining: Practice zone draining before incidents
- Monitor Skew: Watch for uneven task distribution
Configuration Management
- Version Control: Track isolation group configs in Git
- Gradual Rollout: Test changes on non-critical domains first
- Document Zones: Maintain mapping of zones to infrastructure
- Automate Updates: Use CI/CD for isolation group changes
Operational Procedures
- Drain Before Maintenance: Always drain before zone maintenance
- Monitor Completion: Ensure workflows complete before shutdown
- Staged Rollback: Re-enable zones gradually
- Alerting: Alert on zone failures or skewed distribution
Troubleshooting
Tasks Not Routing to Zone
Problem: Workers in a zone not receiving tasks
Solution:
# Check isolation group state
cadence admin cluster get-isolation-groups
# Verify worker configuration
# Check worker logs for isolation group setting
# Check task list partitioning
cadence --do domain tasklist describe --tl task-list
# Verify worker identity includes isolation group
cadence --do domain tasklist list-partition-workers --tl task-list
Zone Not Draining
Problem: Zone marked as draining but still receiving tasks
Solution:
- Existing workflows continue until completion (expected)
- Check for new workflow starts (should be zero)
- Verify drain state persisted:
cadence admin cluster get-isolation-groups
- Check for worker restarts (may revert identity)
Uneven Load Distribution
Problem: Tasks concentrated in one zone
Solution:
- Increase partition count for better distribution
- Verify all zones marked as HEALTHY
- Check worker polling rates in each zone
- Review sticky task list distribution
- Consider rebalancing by draining/re-enabling zones
Advanced Topics
Dynamic Isolation Group Management
Automate isolation group updates based on metrics:
import cadence_client
def auto_drain_unhealthy_zone():
# Monitor zone health
health = get_zone_health("us-east-1a")
if health < THRESHOLD:
# Drain unhealthy zone
client.update_isolation_groups({
"isolationGroups": [
{"name": "us-east-1a", "state": "DRAINING"},
{"name": "us-east-1b", "state": "HEALTHY"},
{"name": "us-east-1c", "state": "HEALTHY"}
]
})
# Alert operations team
send_alert("Zone us-east-1a drained due to health issues")
Cross-Cluster Isolation
Use isolation groups for cross-cluster routing in active-active setups:
# Cluster 1 (US)
isolationGroups:
- name: "cluster-us"
state: "HEALTHY"
# Cluster 2 (EU)
isolationGroups:
- name: "cluster-eu"
state: "HEALTHY"
Workers connect to local cluster but can handle cross-cluster tasks if needed.
Next Steps