Distributed workflows fail. Network partitions, transient service errors, resource exhaustion, and external API timeouts are facts of life in any production system. Aether provides two orthogonal mechanisms to handle these realities: retry policies that re-dispatch failed tasks automatically, and timeout policies that bound how long the engine will wait for a result. Both are declared directly on the workflow resource — no host-side polling, no wrapper scripts.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BabySid/aether/llms.txt
Use this file to discover all available pages before exploring further.
Retry Policy
Retry is configured on a DAG task node (call site) via theretry field. It applies only to leaf tasks (template type task); DAG and Loop container nodes do not support retry directly because their children already have their own retry policies.
Basic Configuration
| Field | Type | Description |
|---|---|---|
limit | int | Maximum number of retries. 0 disables retry. |
expression | string | Boolean expression controlling whether to retry. If omitted, default rules apply. |
How Retry Works
Task completes with a non-success phase
The executor returns an
ExecCode. The engine maps it to a Phase (e.g. PhaseError, PhaseFailed, PhaseTimeout).Engine evaluates the retry policy
If
retry.limit > 0 and the retry budget is not exhausted, the engine checks whether to retry. Default behavior: retry on PhaseError and PhaseTimeout only. PhaseFailed is not retried by default — it signals a known, deterministic failure.Task is reset to PhaseCreated
If retry is warranted, the engine resets the task’s status to
PhaseCreated, increments RetryCount, and re-dispatches it via the broker. The terminal state is never written for an in-progress retry.Metrics.Retries on the completed task run.
Default vs. Expression-Controlled Retry
Default behavior (noexpression): the engine retries on PhaseError and PhaseTimeout. PhaseFailed is excluded because an executor explicitly returns Failed to signal a known, deterministic outcome — retrying would likely produce the same result.
With an expression: the expression fully overrides the default phase filter. You have complete control:
Succeeded — including Failed. The expression context exposes:
| Variable | Description |
|---|---|
tasks.<name>.phase | Phase string: "Failed", "Error", "Timeout", etc. |
tasks.<name>.code | Integer exit code from the executor |
tasks.<name>.msg | Error message string |
tasks.<name>.outputs.parameters.<param> | Output parameter value |
Retry Exhausted
WhenRetryCount >= retry.limit, no further retry occurs. The task transitions to its terminal phase normally:
PhaseFailed, causing the DAG to fail unless continueOn.failed is set.
Per-Task Timeout
Thetimeout field on a DAG task node (or on a standalone task template) sets a deadline on the individual task run. The value is a duration string:
| Unit | Example | Duration |
|---|---|---|
ms | "500ms" | 500 milliseconds |
s | "30s" | 30 seconds |
m | "5m" | 5 minutes |
h | "2h" | 2 hours |
d | "1d" | 24 hours |
Resume() arrives within one second, the timeout watchdog fires. Because continueOn.timeout is set, the downstream finalize task still runs.
What Happens on Task Timeout
Watchdog detects deadline exceeded
The
timeout.Watcher fires Engine.OnTaskTimeout() for the expired task run ID.Broker cancel is attempted
The engine calls
broker.Cancel() as a best-effort fast-path to stop any in-progress executor work.Task transitions to PhaseTimeout
OnTaskCompleted is invoked with ExecCodeTimeout. The engine writes PhaseTimeout to the task run.Retry check runs
If the task has a retry policy that covers
Timeout, a retry is triggered. Otherwise the timeout is terminal.OnTaskTimeout sees a terminal status and returns immediately without taking any action. In a multi-engine deployment, optimistic locking in the store ensures only one writer transitions the task.
Workflow-Level Timeout
spec.timeout sets a deadline on the entire workflow run. If the workflow does not reach a terminal state within the specified duration, the engine cancels all non-terminal tasks and marks the workflow PhaseTimeout.
Workflow Timeout vs. Task Timeout
Task timeout
Bounds a single task run. The rest of the workflow continues (subject to
continueOn). Useful for tasks that poll external services or wait for human approval.Workflow timeout
Bounds the entire workflow execution. All running tasks are cancelled immediately. The workflow-level
onTimeout hook fires. Use for end-to-end SLA enforcement.Retry and Timeout Together
Retry and timeout interact in a predictable way. If a task times out and has a retry policy that coversTimeout, the engine retries the task. Each retry attempt gets a fresh dispatch but reuses the original deadline — the deadline is set once at first dispatch and is not reset per retry attempt.
expression) retries on PhaseError and PhaseTimeout. With timeout: "10s" and retry.limit: 3, the task gets up to three attempts each bounded by its deadline before the retry budget is exhausted.
Retry applies only to leaf tasks. If you need retry semantics for an entire DAG or Loop sub-tree, wrap the sub-tree in a task template and apply retry to the call site of that wrapper.