Skip to main content

Overview

The orchestrator is the central coordination component that owns the poll tick, maintains in-memory runtime state, and decides which issues to dispatch, retry, stop, or release. It is the only component that mutates scheduling state.
All worker outcomes are reported back to the orchestrator and converted into explicit state transitions.

Issue Orchestration States

These are the service’s internal claim states, distinct from tracker states (Todo, In Progress, etc.).

Unclaimed

Issue is not running and has no retry scheduled.

Claimed

Orchestrator has reserved the issue to prevent duplicate dispatch. In practice, claimed issues are either Running or RetryQueued.

Running

Worker task exists and the issue is tracked in running map.

RetryQueued

Worker is not running, but a retry timer exists in retry_attempts.

Released

Claim removed because issue is terminal, non-active, missing, or retry path completed without re-dispatch.

Important Nuance

Run Attempt Lifecycle

A run attempt transitions through these phases:

PreparingWorkspace

Creating or reusing workspace directory

BuildingPrompt

Rendering prompt template with issue context

LaunchingAgentProcess

Starting the Codex app-server subprocess

InitializingSession

Completing session startup handshake

StreamingTurn

Processing agent turn events

Finishing

Cleanup and final state determination

Succeeded

Normal completion

Failed

Failed with error

TimedOut

Exceeded turn timeout

Stalled

No activity within stall timeout

CanceledByReconciliation

Stopped due to state change
Distinct terminal reasons are important because retry logic and logs differ based on the failure mode.

Transition Triggers

The orchestrator responds to these event triggers:

Poll Tick

  • Reconcile active runs
  • Validate config
  • Fetch candidate issues
  • Dispatch until slots are exhausted

Worker Exit (Normal)

  • Remove running entry
  • Update aggregate runtime totals
  • Schedule continuation retry (attempt 1) after the worker exhausts or finishes its in-process turn loop

Worker Exit (Abnormal)

  • Remove running entry
  • Update aggregate runtime totals
  • Schedule exponential-backoff retry

Codex Update Event

  • Update live session fields, token counters, and rate limits

Retry Timer Fired

  • Re-fetch active candidates and attempt re-dispatch, or release claim if no longer eligible

Reconciliation State Refresh

  • Stop runs whose issue states are terminal or no longer active

Stall Timeout

  • Kill worker and schedule retry

Polling and Scheduling

Poll Loop

At startup, the service validates config, performs startup cleanup, schedules an immediate tick, and then repeats every polling.interval_ms.
If per-tick validation fails, dispatch is skipped for that tick, but reconciliation still happens first.

Candidate Selection Rules

An issue is dispatch-eligible only if all are true:
  • It has id, identifier, title, and state
  • Its state is in active_states and not in terminal_states
  • It is not already in running
  • It is not already in claimed
  • Global concurrency slots are available
  • Per-state concurrency slots are available
  • Blocker rule for Todo state: If the issue state is Todo, do not dispatch when any blocker is non-terminal

Sorting Order

Issues are sorted by:
  1. priority ascending (1..4 are preferred; null/unknown sorts last)
  2. created_at oldest first
  3. identifier lexicographic tie-breaker

Concurrency Control

Global Limit
available_slots = max(max_concurrent_agents - running_count, 0)
Per-State Limit
max_concurrent_agents_by_state[state] if present (state key normalized), otherwise fallback to global limit
The runtime counts issues by their current tracked state in the running map.

Retry and Backoff

Retry Entry Creation

  • Cancel any existing retry timer for the same issue
  • Store attempt, identifier, error, due_at_ms, and new timer handle

Backoff Formula

Continuation Retries

Normal continuation retries after a clean worker exit use a short fixed delay of 1000 ms.

Failure Retries

Failure-driven retries use:delay = min(10000 * 2^(attempt - 1), agent.max_retry_backoff_ms)Power is capped by the configured max retry backoff (default 300000 / 5 minutes).

Retry Handling Behavior

Terminal-state workspace cleanup is handled by startup cleanup and active-run reconciliation, not by retry handling itself.

Active Run Reconciliation

Reconciliation runs every tick and has two parts:

Part A: Stall Detection

  • For each running issue, compute elapsed_ms since:
    • last_codex_timestamp if any event has been seen, else
    • started_at
  • If elapsed_ms > codex.stall_timeout_ms, terminate the worker and queue a retry
  • If stall_timeout_ms <= 0, skip stall detection entirely

Part B: Tracker State Refresh

Startup Terminal Workspace Cleanup

When the service starts:
  1. Query tracker for issues in terminal states
  2. For each returned issue identifier, remove the corresponding workspace directory
  3. If the terminal-issues fetch fails, log a warning and continue startup
This prevents stale terminal workspaces from accumulating after restarts.

Idempotency and Recovery Rules

Single Authority

The orchestrator serializes state mutations through one authority to avoid duplicate dispatch.

Required Checks

claimed and running checks are required before launching any worker.

Reconciliation First

Reconciliation runs before dispatch on every tick.

Restart Recovery

Tracker-driven and filesystem-driven (no durable orchestrator DB required).

Startup Cleanup

Removes stale workspaces for issues already in terminal states.

Build docs developers (and LLMs) love