Retry and Timeout Policies for Resilient Aether Tasks

Distributed workflows fail. Network partitions, transient service errors, resource exhaustion, and external API timeouts are facts of life in any production system. Aether provides two orthogonal mechanisms to handle these realities: retry policies that re-dispatch failed tasks automatically, and timeout policies that bound how long the engine will wait for a result. Both are declared directly on the workflow resource — no host-side polling, no wrapper scripts.

Retry Policy

Retry is configured on a DAG task node (call site) via the retry field. It applies only to leaf tasks (template type task); DAG and Loop container nodes do not support retry directly because their children already have their own retry policies.

Basic Configuration

{
  "dag": {
    "name": "main",
    "tasks": [
      {
        "name": "flaky",
        "template": "flaky-task",
        "retry": {
          "limit": 3,
          "expression": "tasks.flaky.phase != 'Succeeded'"
        }
      }
    ]
  }
}

Field	Type	Description
`limit`	int	Maximum number of retries. `0` disables retry.
`expression`	string	Boolean expression controlling whether to retry. If omitted, default rules apply.

How Retry Works

Task completes with a non-success phase

The executor returns an ExecCode. The engine maps it to a Phase (e.g. PhaseError, PhaseFailed, PhaseTimeout).

Engine evaluates the retry policy

If retry.limit > 0 and the retry budget is not exhausted, the engine checks whether to retry. Default behavior: retry on PhaseError and PhaseTimeout only. PhaseFailed is not retried by default — it signals a known, deterministic failure.

Task is reset to PhaseCreated

If retry is warranted, the engine resets the task’s status to PhaseCreated, increments RetryCount, and re-dispatches it via the broker. The terminal state is never written for an in-progress retry.

Executor runs again with full original inputs

The task runs again with the same inputs. On the final attempt (when the budget is exhausted), the terminal phase is written and hooks fire.

The retry counter is exposed in Metrics.Retries on the completed task run.

Default vs. Expression-Controlled Retry

Default behavior (no expression): the engine retries on PhaseError and PhaseTimeout. PhaseFailed is excluded because an executor explicitly returns Failed to signal a known, deterministic outcome — retrying would likely produce the same result. With an expression: the expression fully overrides the default phase filter. You have complete control:

{
  "retry": {
    "limit": 3,
    "expression": "tasks.flaky.phase != 'Succeeded'"
  }
}

This retries on any phase that is not Succeeded — including Failed. The expression context exposes:

Variable	Description
`tasks.<name>.phase`	Phase string: `"Failed"`, `"Error"`, `"Timeout"`, etc.
`tasks.<name>.code`	Integer exit code from the executor
`tasks.<name>.msg`	Error message string
`tasks.<name>.outputs.parameters.<param>`	Output parameter value

Retry Exhausted

When RetryCount >= retry.limit, no further retry occurs. The task transitions to its terminal phase normally:

{
  "dag": {
    "name": "main",
    "tasks": [
      {
        "name": "doomed",
        "template": "always-fail",
        "retry": {
          "limit": 2,
          "expression": "tasks.doomed.phase != 'Succeeded'"
        }
      }
    ]
  }
}

A task that always fails exhausts its two retries (attempts 1, 2, 3 total) and then transitions to PhaseFailed, causing the DAG to fail unless continueOn.failed is set.

Per-Task Timeout

The timeout field on a DAG task node (or on a standalone task template) sets a deadline on the individual task run. The value is a duration string:

Unit	Example	Duration
`ms`	`"500ms"`	500 milliseconds
`s`	`"30s"`	30 seconds
`m`	`"5m"`	5 minutes
`h`	`"2h"`	2 hours
`d`	`"1d"`	24 hours

{
  "name": "wait-external",
  "dependencies": ["prepare"],
  "timeout": "1s",
  "continueOn": {
    "timeout": true
  },
  "inputs": {
    "parameters": [
      { "name": "suspend", "value": true }
    ]
  },
  "executor": { "type": "echo" }
}

In this example, the task suspends waiting for an external signal. If no Resume() arrives within one second, the timeout watchdog fires. Because continueOn.timeout is set, the downstream finalize task still runs.

What Happens on Task Timeout

Watchdog detects deadline exceeded

The timeout.Watcher fires Engine.OnTaskTimeout() for the expired task run ID.

Broker cancel is attempted

The engine calls broker.Cancel() as a best-effort fast-path to stop any in-progress executor work.

Task transitions to PhaseTimeout

OnTaskCompleted is invoked with ExecCodeTimeout. The engine writes PhaseTimeout to the task run.

Retry check runs

If the task has a retry policy that covers Timeout, a retry is triggered. Otherwise the timeout is terminal.

Scope advances

The engine re-evaluates the parent DAG. Tasks with continueOn.timeout unblock downstream nodes; others cause the DAG to fail.

Timeout handling is idempotent — if the task has already completed normally before the watchdog fires, OnTaskTimeout sees a terminal status and returns immediately without taking any action. In a multi-engine deployment, optimistic locking in the store ensures only one writer transitions the task.

Workflow-Level Timeout

spec.timeout sets a deadline on the entire workflow run. If the workflow does not reach a terminal state within the specified duration, the engine cancels all non-terminal tasks and marks the workflow PhaseTimeout.

{
  "spec": {
    "entrypoint": "main",
    "timeout": "30m",
    "templates": []
  }
}

Workflow Timeout vs. Task Timeout

Task timeout

Bounds a single task run. The rest of the workflow continues (subject to continueOn). Useful for tasks that poll external services or wait for human approval.

Workflow timeout

Bounds the entire workflow execution. All running tasks are cancelled immediately. The workflow-level onTimeout hook fires. Use for end-to-end SLA enforcement.

Both timeouts can coexist. A task timeout fires first if the individual task exceeds its limit; the workflow timeout fires if the aggregate execution time exceeds its limit regardless of individual task status.

Retry and Timeout Together

Retry and timeout interact in a predictable way. If a task times out and has a retry policy that covers Timeout, the engine retries the task. Each retry attempt gets a fresh dispatch but reuses the original deadline — the deadline is set once at first dispatch and is not reset per retry attempt.

{
  "name": "flaky",
  "template": "flaky-task",
  "timeout": "10s",
  "retry": {
    "limit": 3
  }
}

The default retry policy (no expression) retries on PhaseError and PhaseTimeout. With timeout: "10s" and retry.limit: 3, the task gets up to three attempts each bounded by its deadline before the retry budget is exhausted.

Retry applies only to leaf tasks. If you need retry semantics for an entire DAG or Loop sub-tree, wrap the sub-tree in a task template and apply retry to the call site of that wrapper.

Get Started

Core Concepts

Guides

Extension Points

Retry and Timeout Policies for Resilient Aether Tasks

Retry Policy

Basic Configuration

How Retry Works

Default vs. Expression-Controlled Retry

Retry Exhausted

Per-Task Timeout

What Happens on Task Timeout

Workflow-Level Timeout

Workflow Timeout vs. Task Timeout

Task timeout

Workflow timeout

Retry and Timeout Together

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Extension Points

Documentation Index

​Retry Policy

​Basic Configuration

​How Retry Works

​Default vs. Expression-Controlled Retry

​Retry Exhausted

​Per-Task Timeout

​What Happens on Task Timeout

​Workflow-Level Timeout

​Workflow Timeout vs. Task Timeout

Task timeout

Workflow timeout

​Retry and Timeout Together

Build docs developers (and LLMs) love

Retry Policy

Basic Configuration

How Retry Works

Default vs. Expression-Controlled Retry

Retry Exhausted

Per-Task Timeout

What Happens on Task Timeout

Workflow-Level Timeout

Workflow Timeout vs. Task Timeout

Retry and Timeout Together