Overview
The orchestrator is the central coordination component that owns the poll tick, maintains in-memory runtime state, and decides which issues to dispatch, retry, stop, or release. It is the only component that mutates scheduling state.All worker outcomes are reported back to the orchestrator and converted into explicit state transitions.
Issue Orchestration States
These are the service’s internal claim states, distinct from tracker states (Todo, In Progress, etc.).
Unclaimed
Issue is not running and has no retry scheduled.
Claimed
Orchestrator has reserved the issue to prevent duplicate dispatch. In practice, claimed issues are either
Running or RetryQueued.Running
Worker task exists and the issue is tracked in
running map.RetryQueued
Worker is not running, but a retry timer exists in
retry_attempts.Released
Claim removed because issue is terminal, non-active, missing, or retry path completed without re-dispatch.
Important Nuance
Run Attempt Lifecycle
A run attempt transitions through these phases:PreparingWorkspace
Creating or reusing workspace directory
BuildingPrompt
Rendering prompt template with issue context
LaunchingAgentProcess
Starting the Codex app-server subprocess
InitializingSession
Completing session startup handshake
StreamingTurn
Processing agent turn events
Finishing
Cleanup and final state determination
Succeeded
Normal completion
Failed
Failed with error
TimedOut
Exceeded turn timeout
Stalled
No activity within stall timeout
CanceledByReconciliation
Stopped due to state change
Distinct terminal reasons are important because retry logic and logs differ based on the failure mode.
Transition Triggers
The orchestrator responds to these event triggers:Poll Tick
- Reconcile active runs
- Validate config
- Fetch candidate issues
- Dispatch until slots are exhausted
Worker Exit (Normal)
- Remove running entry
- Update aggregate runtime totals
- Schedule continuation retry (attempt
1) after the worker exhausts or finishes its in-process turn loop
Worker Exit (Abnormal)
- Remove running entry
- Update aggregate runtime totals
- Schedule exponential-backoff retry
Codex Update Event
- Update live session fields, token counters, and rate limits
Retry Timer Fired
- Re-fetch active candidates and attempt re-dispatch, or release claim if no longer eligible
Reconciliation State Refresh
- Stop runs whose issue states are terminal or no longer active
Stall Timeout
- Kill worker and schedule retry
Polling and Scheduling
Poll Loop
At startup, the service validates config, performs startup cleanup, schedules an immediate tick, and then repeats everypolling.interval_ms.
If per-tick validation fails, dispatch is skipped for that tick, but reconciliation still happens first.
Candidate Selection Rules
An issue is dispatch-eligible only if all are true:- It has
id,identifier,title, andstate - Its state is in
active_statesand not interminal_states - It is not already in
running - It is not already in
claimed - Global concurrency slots are available
- Per-state concurrency slots are available
- Blocker rule for
Todostate: If the issue state isTodo, do not dispatch when any blocker is non-terminal
Sorting Order
Issues are sorted by:priorityascending (1..4 are preferred; null/unknown sorts last)created_atoldest firstidentifierlexicographic tie-breaker
Concurrency Control
Global Limit
available_slots = max(max_concurrent_agents - running_count, 0)Per-State Limit
max_concurrent_agents_by_state[state] if present (state key normalized), otherwise fallback to global limitThe runtime counts issues by their current tracked state in the
running map.Retry and Backoff
Retry Entry Creation
- Cancel any existing retry timer for the same issue
- Store
attempt,identifier,error,due_at_ms, and new timer handle
Backoff Formula
Continuation Retries
Normal continuation retries after a clean worker exit use a short fixed delay of 1000 ms.
Failure Retries
Failure-driven retries use:
delay = min(10000 * 2^(attempt - 1), agent.max_retry_backoff_ms)Power is capped by the configured max retry backoff (default 300000 / 5 minutes).Retry Handling Behavior
Terminal-state workspace cleanup is handled by startup cleanup and active-run reconciliation, not by retry handling itself.
Active Run Reconciliation
Reconciliation runs every tick and has two parts:Part A: Stall Detection
- For each running issue, compute
elapsed_mssince:last_codex_timestampif any event has been seen, elsestarted_at
- If
elapsed_ms > codex.stall_timeout_ms, terminate the worker and queue a retry - If
stall_timeout_ms <= 0, skip stall detection entirely
Part B: Tracker State Refresh
Startup Terminal Workspace Cleanup
When the service starts:- Query tracker for issues in terminal states
- For each returned issue identifier, remove the corresponding workspace directory
- If the terminal-issues fetch fails, log a warning and continue startup
This prevents stale terminal workspaces from accumulating after restarts.
Idempotency and Recovery Rules
Single Authority
The orchestrator serializes state mutations through one authority to avoid duplicate dispatch.
Required Checks
claimed and running checks are required before launching any worker.Reconciliation First
Reconciliation runs before dispatch on every tick.
Restart Recovery
Tracker-driven and filesystem-driven (no durable orchestrator DB required).
Startup Cleanup
Removes stale workspaces for issues already in terminal states.