Orchestrator - Symphony

Overview

The orchestrator is the central coordination component that owns the poll tick, maintains in-memory runtime state, and decides which issues to dispatch, retry, stop, or release. It is the only component that mutates scheduling state.

All worker outcomes are reported back to the orchestrator and converted into explicit state transitions.

Issue Orchestration States

These are the service’s internal claim states, distinct from tracker states (Todo, In Progress, etc.).

Unclaimed

Issue is not running and has no retry scheduled.

Claimed

Orchestrator has reserved the issue to prevent duplicate dispatch. In practice, claimed issues are either Running or RetryQueued.

Running

Worker task exists and the issue is tracked in running map.

RetryQueued

Worker is not running, but a retry timer exists in retry_attempts.

Released

Claim removed because issue is terminal, non-active, missing, or retry path completed without re-dispatch.

Important Nuance

Show Continuation Behavior

A successful worker exit does not mean the issue is done forever
The worker may continue through multiple back-to-back coding-agent turns before it exits
After each normal turn completion, the worker re-checks the tracker issue state
If the issue is still in an active state, the worker should start another turn on the same live coding-agent thread in the same workspace, up to agent.max_turns
The first turn should use the full rendered task prompt
Continuation turns should send only continuation guidance to the existing thread, not resend the original task prompt
Once the worker exits normally, the orchestrator still schedules a short continuation retry (about 1 second) to re-check whether the issue remains active

Run Attempt Lifecycle

A run attempt transitions through these phases:

PreparingWorkspace

Creating or reusing workspace directory

BuildingPrompt

Rendering prompt template with issue context

LaunchingAgentProcess

Starting the Codex app-server subprocess

InitializingSession

Completing session startup handshake

StreamingTurn

Processing agent turn events

Finishing

Cleanup and final state determination

Succeeded

Normal completion

Failed

Failed with error

TimedOut

Exceeded turn timeout

Stalled

No activity within stall timeout

CanceledByReconciliation

Stopped due to state change

Distinct terminal reasons are important because retry logic and logs differ based on the failure mode.

Transition Triggers

The orchestrator responds to these event triggers:

Poll Tick

Reconcile active runs
Validate config
Fetch candidate issues
Dispatch until slots are exhausted

Worker Exit (Normal)

Remove running entry
Update aggregate runtime totals
Schedule continuation retry (attempt 1) after the worker exhausts or finishes its in-process turn loop

Worker Exit (Abnormal)

Remove running entry
Update aggregate runtime totals
Schedule exponential-backoff retry

Codex Update Event

Update live session fields, token counters, and rate limits

Retry Timer Fired

Re-fetch active candidates and attempt re-dispatch, or release claim if no longer eligible

Reconciliation State Refresh

Stop runs whose issue states are terminal or no longer active

Stall Timeout

Kill worker and schedule retry

Polling and Scheduling

Poll Loop

At startup, the service validates config, performs startup cleanup, schedules an immediate tick, and then repeats every polling.interval_ms.

Show Tick Sequence

Reconcile running issues
Run dispatch preflight validation
Fetch candidate issues from tracker using active states
Sort issues by dispatch priority
Dispatch eligible issues while slots remain
Notify observability/status consumers of state changes

If per-tick validation fails, dispatch is skipped for that tick, but reconciliation still happens first.

Candidate Selection Rules

An issue is dispatch-eligible only if all are true:

It has id, identifier, title, and state
Its state is in active_states and not in terminal_states
It is not already in running
It is not already in claimed
Global concurrency slots are available
Per-state concurrency slots are available
Blocker rule for Todo state: If the issue state is Todo, do not dispatch when any blocker is non-terminal

Sorting Order

Issues are sorted by:

priority ascending (1..4 are preferred; null/unknown sorts last)
created_at oldest first
identifier lexicographic tie-breaker

Concurrency Control

Global Limit

available_slots = max(max_concurrent_agents - running_count, 0)

Per-State Limit

max_concurrent_agents_by_state[state] if present (state key normalized), otherwise fallback to global limit

The runtime counts issues by their current tracked state in the running map.

Retry and Backoff

Retry Entry Creation

Cancel any existing retry timer for the same issue
Store attempt, identifier, error, due_at_ms, and new timer handle

Backoff Formula

Continuation Retries

Normal continuation retries after a clean worker exit use a short fixed delay of 1000 ms.

Failure Retries

Failure-driven retries use:delay = min(10000 * 2^(attempt - 1), agent.max_retry_backoff_ms)Power is capped by the configured max retry backoff (default 300000 / 5 minutes).

Retry Handling Behavior

Show Retry Timer Processing

Fetch active candidate issues (not all issues)
Find the specific issue by issue_id
If not found, release claim
If found and still candidate-eligible:
- Dispatch if slots are available
- Otherwise requeue with error no available orchestrator slots
If found but no longer active, release claim

Terminal-state workspace cleanup is handled by startup cleanup and active-run reconciliation, not by retry handling itself.

Active Run Reconciliation

Reconciliation runs every tick and has two parts:

Part A: Stall Detection

For each running issue, compute elapsed_ms since:
- last_codex_timestamp if any event has been seen, else
- started_at
If elapsed_ms > codex.stall_timeout_ms, terminate the worker and queue a retry
If stall_timeout_ms <= 0, skip stall detection entirely

Part B: Tracker State Refresh

Show State Refresh Process

Fetch current issue states for all running issue IDs
For each running issue:
- If tracker state is terminal: terminate worker and clean workspace
- If tracker state is still active: update the in-memory issue snapshot
- If tracker state is neither active nor terminal: terminate worker without workspace cleanup
If state refresh fails, keep workers running and try again on the next tick

Startup Terminal Workspace Cleanup

When the service starts:

Query tracker for issues in terminal states
For each returned issue identifier, remove the corresponding workspace directory
If the terminal-issues fetch fails, log a warning and continue startup

This prevents stale terminal workspaces from accumulating after restarts.

Idempotency and Recovery Rules

Single Authority

The orchestrator serializes state mutations through one authority to avoid duplicate dispatch.

Required Checks

claimed and running checks are required before launching any worker.

Reconciliation First

Reconciliation runs before dispatch on every tick.

Restart Recovery

Tracker-driven and filesystem-driven (no durable orchestrator DB required).

Startup Cleanup

Removes stale workspaces for issues already in terminal states.

CLI

Configuration

Specification

Documentation Index

​Overview

​Issue Orchestration States

Unclaimed

Claimed

Running

RetryQueued

Released

​Important Nuance

​Run Attempt Lifecycle

PreparingWorkspace

BuildingPrompt

LaunchingAgentProcess

InitializingSession

StreamingTurn

Finishing

Succeeded

Failed

TimedOut

Stalled

CanceledByReconciliation

​Transition Triggers

​Poll Tick

​Worker Exit (Normal)

​Worker Exit (Abnormal)

​Codex Update Event

​Retry Timer Fired

​Reconciliation State Refresh

​Stall Timeout

​Polling and Scheduling

​Poll Loop

​Candidate Selection Rules

​Sorting Order

​Concurrency Control

​Retry and Backoff

​Retry Entry Creation

​Backoff Formula

Continuation Retries

Failure Retries

​Retry Handling Behavior

​Active Run Reconciliation

​Part A: Stall Detection

​Part B: Tracker State Refresh

​Startup Terminal Workspace Cleanup

​Idempotency and Recovery Rules

Single Authority

Required Checks

Reconciliation First

Restart Recovery

Startup Cleanup

Build docs developers (and LLMs) love

Overview

Issue Orchestration States

Important Nuance

Run Attempt Lifecycle

Transition Triggers

Poll Tick

Worker Exit (Normal)

Worker Exit (Abnormal)

Codex Update Event

Retry Timer Fired

Reconciliation State Refresh

Stall Timeout

Polling and Scheduling

Poll Loop

Candidate Selection Rules

Sorting Order

Concurrency Control

Retry and Backoff

Retry Entry Creation

Backoff Formula

Retry Handling Behavior

Active Run Reconciliation

Part A: Stall Detection

Part B: Tracker State Refresh

Startup Terminal Workspace Cleanup

Idempotency and Recovery Rules