Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/microsoft/agent-governance-toolkit/llms.txt

Use this file to discover all available pages before exploring further.

This specification defines the Site Reliability Engineering (SRE) governance layer for autonomous AI agents. Just as traditional SRE applies error budgets, SLOs, and incident management to software services, Agent SRE applies these same principles to AI agent operations — treating agent reliability as a measurable, enforceable, and continuously improvable property. All SDK implementations MUST conform to this specification. The key words are interpreted as described in RFC 2119.

Status

Draft

Date

2025-07-28

Conformance Tests

111 tests

Design Principles

  1. Measure everything. Every agent operation SHOULD produce SLI measurements that feed SLO evaluation.
  2. Fail closed. All enforcement and detection components MUST deny or isolate on internal error, never silently permit.
  3. Budget-driven decisions. Deployment, throttling, and circuit breaking decisions MUST be driven by error budget state.
  4. Deterministic replay. Traces MUST capture sufficient detail to allow deterministic re-execution and regression detection.
  5. Append-only audit. Trace hashes, alert histories, and incident records MUST be tamper-evident.

SLO Definition and Measurement

SLO Model

An SLO MUST combine one or more SLIs with an error budget to define agent reliability.
FieldTypeRequiredDefaultConstraints
namestringYesNon-empty, unique within a deployment
indicatorslist[SLI]YesAt least one SLI
error_budgetErrorBudgetNoauto-derivedDerived from strictest indicator if omitted
descriptionstringNo""Free-form text
labelsdict[str, str]No{}Arbitrary key-value metadata
alert_managerAlertManagerNonullIf set, alerts fire on status transitions
agent_idstringNo""DID or identifier of the owning agent

SLO Health States

Implementations MUST define the following SLO health states in this severity order:
StatusValueMeaning
HEALTHY0Within budget, no alerts firing
UNKNOWN1Insufficient data to evaluate
WARNING2Burn rate elevated, warning-level alert firing
CRITICAL3Burn rate critical, budget at risk
EXHAUSTED4Error budget fully consumed
Alert transitions MUST fire only when severity increases (worsens) or when the SLO recovers to HEALTHY.

SLO Evaluation Precedence

The evaluate() method MUST determine status using:
  1. If error_budget.is_exhausted is true → EXHAUSTED
  2. If any firing alert has severity "critical"CRITICAL
  3. If any firing alert has severity "warning"WARNING
  4. If no indicator has a non-None current_value()UNKNOWN
  5. Otherwise → HEALTHY

Exhaustion Actions

When the error budget is fully consumed, the SLO engine MUST support:
ActionMeaning
ALERTSend an alert only
FREEZE_DEPLOYMENTSHalt new agent deployments
CIRCUIT_BREAKOpen the agent’s circuit breaker
THROTTLEReduce the agent’s request rate

Auto-Budget Derivation

When no explicit ErrorBudget is provided, the SLO MUST derive the total budget as:
error_budget.total = 1.0 - min(sli.target for sli in indicators)

Service Level Indicators

Built-in SLI Types

Implementations MUST provide these eight SLI types with their default targets:
SLI TypeMetric NameDefault TargetDefault WindowSemantics
TaskSuccessRatetask_success_rate0.995 (99.5%)30dFraction of tasks completed successfully
ToolCallAccuracytool_call_accuracy0.999 (99.9%)7dFraction of tool calls selecting the correct tool
ResponseLatencyresponse_latency_p{N}5000 ms1hResponse latency at configurable percentile (default p95)
CostPerTaskcost_per_task$0.5024hAverage cost per task in USD
PolicyCompliancepolicy_compliance1.0 (100%)24hAdherence to Agent OS governance policies
DelegationChainDepthscope_chain_depth3 (max depth)24hMaximum delegation chain depth; lower is better
HallucinationRatehallucination_rate0.05 (5%)24hHallucination rate via LLM-as-judge; lower is better
CalibrationDeltaSLIcalibration_delta0.0530dGap between predicted confidence and actual success rate; lower is better

Inverted SLIs

For DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI, the compliance() method MUST return the fraction of measurements where value <= target (not value >= target).

TimeWindow Enum

Implementations MUST support these standard time windows:
ValueLabelSeconds
HOUR_1"1h"3,600
HOUR_6"6h"21,600
DAY_1"24h"86,400
DAY_7"7d"604,800
DAY_30"30d"2,592,000

ResponseLatency Percentile

The ResponseLatency SLI MUST compute current_value() as the configured percentile of recorded latencies (not the mean). The percentile index MUST be:
idx = min(int(len(sorted_values) * percentile), len(sorted_values) - 1)

Error Budget Tracking

Error Budget Formula

remaining = total - consumed
is_exhausted = consumed >= total
When error_budget.total == 0: remaining MUST return 0.0; is_exhausted MUST return True if consumed >= 0. Implementations MUST NOT divide by zero.

Burn Rate Computation

burn_rate = consumed_in_window / expected_consumption_in_window
A burn rate of 1.0 means consuming at exactly the allowed pace. Burn rate > 1.0 means budget is depleting faster than planned. Edge cases:
  • If expected_consumption == 0.0, burn rate MUST return 0.0.
  • If total == 0.0, burn rate MUST return 0.0 when consumed == 0.0, or positive infinity otherwise.

BurnRateAlert

Implementations MUST support burn rate alert thresholds:
alert fires when: current_burn_rate >= threshold.rate
alert resolves when: current_burn_rate < threshold.rate
Alerts are classified as warning (burn rate ≥ 2.0) or critical (burn rate ≥ 10.0) by default.

Circuit Breaker States and Transitions

CircuitState Enum

Implementations MUST define exactly three states:
StateMeaning
CLOSEDNormal operation. Failures are counted.
OPENFailures exceeded threshold. All calls rejected.
HALF_OPENTesting recovery. One probe call allowed through.

State Transition Rules

CLOSED → OPEN:      failure_count >= failure_threshold
OPEN → HALF_OPEN:   reset_timeout has elapsed
HALF_OPEN → CLOSED: probe call succeeds
HALF_OPEN → OPEN:   probe call fails
In Public Preview, HALF_OPEN is not auto-entered. Manual recovery via force_close() or reset() is required.

Circuit Breaker Configuration

FieldDefaultDescription
failure_threshold5Consecutive failures before opening
reset_timeout30sTime before attempting recovery (HALF_OPEN)
half_open_max_calls1Probe calls allowed in HALF_OPEN state

Special Transitions

  • force_open() called when already OPEN MUST be a no-op.
  • force_close() called when already CLOSED MUST be a no-op.
  • The circuit breaker feeds the Hypervisor kill switch: when the breaker opens, it SHOULD trigger the kill switch for the associated agent.

Chaos Injection Types

Implementations MUST support all twelve fault types:
FaultTypeDescription
LATENCY_INJECTIONIntroduce configurable delay before action execution
ERROR_INJECTIONInject a synthetic error/exception
TIMEOUT_INJECTIONForce a timeout during execution
RESOURCE_EXHAUSTIONSimulate memory or CPU limit reached
NETWORK_PARTITIONBlock network access for the target
DATA_CORRUPTIONReturn corrupted/malformed responses
CLOCK_SKEWAdvance or retard the agent’s perceived time
DEPENDENCY_FAILUREFail a specific downstream dependency
PARTIAL_FAILURESucceed for a fraction of calls
CASCADE_FAILURETrigger failures in dependent agents
ADVERSARIAL_INPUTInject adversarial prompts or inputs
RATE_LIMIT_EXCEEDEDSimulate rate limit responses from dependencies

Blast Radius

The blast_radius field MUST be clamped to [0.0, 1.0]:
  • blast_radius = -0.5 MUST be clamped to 0.0.
  • blast_radius = 1.5 MUST be clamped to 1.0.
  • blast_radius = 0.0 means no traffic is affected.

Abort Conditions

Abort conditions halt experiments when triggered:
abort when: actual_metric_value {comparator} threshold
When check_abort(metrics_snapshot) detects a triggered condition, the experiment state MUST transition to ABORTED and abort_reason MUST be set.

Progressive Delivery (Canary Releases)

The SRE specification includes progressive delivery semantics tied to error budget state:
  1. When error_budget.is_exhausted, the FREEZE_DEPLOYMENTS exhaustion action MUST halt new agent deployments.
  2. Canary rollouts MUST observe SLO status before advancing traffic percentage.
  3. If SLO status worsens during a canary rollout, the rollout MUST be paused or rolled back.
Traffic routing decisions MUST be driven by error budget state:
Budget StateRecommended Action
> 50% remainingFull velocity deployments permitted
25–50% remainingRequire SLO review before deployment
< 25% remainingFreeze deployments
ExhaustedMandatory freeze; circuit break

Replay Debugging Semantics

Trace Model

A Trace represents a complete agent execution captured for replay:
FieldTypeDescription
trace_idstringUnique trace identifier
agent_idstringDID of the executing agent
taskstringTask description or prompt
spanslist[Span]All execution spans
content_hashstringSHA-256 of trace metadata (tamper detection)
finished_atdatetime or nullCompletion timestamp

Span Kinds

Implementations MUST define exactly six span kinds:
SpanKindDescription
LLM_INFERENCELLM model call
TOOL_CALLTool invocation
POLICY_CHECKGovernance policy evaluation
DELEGATIONAgent-to-agent delegation
MEMORY_READContext or memory retrieval
MEMORY_WRITEContext or memory storage

Content Hash Invariant

Trace content_hash MUST be computed as SHA-256 of the trace metadata (excluding spans). Any modification to trace metadata MUST invalidate the hash, enabling tamper detection.

PII Redaction

PII redaction MUST be applied before trace persistence. Traces MUST NOT be written to disk or returned by the API with unredacted PII.

TraceStore Path Traversal

TraceStore MUST reject path traversal attempts. Any requested trace path containing .. MUST be rejected.

Golden Traces

A golden trace is a captured execution marked as the expected-correct reference for regression testing. Golden trace comparisons MUST report deviations in: span count, span kinds, tool call sequence, cost, and latency.

Alerting

Supported Alert Channels

Implementations MUST support all six channel types:
AlertChannelDescription
WEBHOOKHTTP POST to a configured URL
EMAILEmail dispatch
SLACKSlack webhook
PAGERDUTYPagerDuty Events API
TEAMSMicrosoft Teams webhook
NULLDiscards alerts (testing/dev)

Alert Deduplication

Duplicate alerts with the same dedup_key within the dedup_window_seconds MUST be suppressed. A RESOLVED alert MUST clear the dedup cache entry even if the original alert was never deduplicated.
send(alert_1, dedup_key="agent-1:slo-1")  → dispatched
send(alert_2, dedup_key="agent-1:slo-1") at T+60s  → suppressed (within 300s window)
send(resolved, dedup_key="agent-1:slo-1")  → dispatched, clears cache
send(alert_3, dedup_key="agent-1:slo-1")  → dispatched (cache was cleared)

Incident Detection

Signal Classification

The IncidentDetector creates incidents only for P1 and P2 signals:
Signal TypeDefault Severity
SLO_BREACHP2
COST_ANOMALYP2
POLICY_VIOLATIONP1
TRUST_DEGRADATIONP2
HALLUCINATION_SPIKEP2
KILL_SWITCH_ACTIVATEDP1

Incident Correlation

When multiple signals arrive from the same source within correlation_window_seconds, they MUST be correlated into a single incident:
  • Title: "Correlated: {signal_types} from {source}"
  • Severity: highest severity among correlated signals
  • Actions: union of response actions from all signal types
Signal deduplication and correlation windows MUST be enforced. Only the first P1/P2 signal for a given source within the dedup window creates an incident; subsequent duplicates MUST be suppressed.

OpenTelemetry Integration

All OTEL semantic conventions MUST use the agent.* namespace:
AttributeDescription
agent.idAgent DID
agent.ringExecution ring (0–3)
agent.trust_scoreCurrent trust score
agent.taskTask description
agent.tool_nameTool name (for TOOL_CALL spans)
agent.policy_decisionPolicy decision (for POLICY_CHECK spans)
agent.delegation_depthChain depth (for DELEGATION spans)
All metric instruments MUST follow Prometheus naming conventions (snake_case, units in name, _total suffix for counters).

Artifact Signing

ArtifactSigner MUST use Ed25519 exclusively for signing agent build artifacts and SBOMs. A SignatureBundle MUST contain:
FieldDescription
signatureBase64-encoded Ed25519 signature
public_keyBase64-encoded public key
artifact_hashSHA-256 of artifact file contents
timestampISO 8601 UTC signing time
sign_artifact(path) called on a non-existent path MUST raise a filesystem error. A bundle with empty or zero-length signature MUST NOT be returned.

Conformance Requirements

A conforming implementation MUST:
  1. SLO evaluation follows the specified precedence rules.
  2. SLOStatus enum contains all five states with correct ordering.
  3. ExhaustionAction enum contains all four values.
  4. All eight built-in SLI types are implemented with correct defaults.
  5. Inverted SLIs use value <= target for compliance.
  6. Error budget remaining computation matches the formula.
  7. Burn rate handles zero and infinite cases correctly.
  8. CircuitState enum contains CLOSED, OPEN, and HALF_OPEN.
  9. Circuit breaker transitions follow the specified rules.
  10. All twelve FaultType values are defined.
  11. Blast radius is clamped to [0.0, 1.0].
  12. Abort conditions halt experiments when triggered.
  13. All six AlertChannel types are supported.
  14. Alert deduplication respects the dedup window.
  15. RESOLVED alerts clear the dedup cache.
  16. All six SpanKind values are defined.
  17. Trace content hash uses SHA-256.
  18. TraceStore rejects path traversal.
  19. PII redaction is applied before trace persistence.
  20. ArtifactSigner uses Ed25519 exclusively.
  21. SignatureBundle supports round-trip serialization.
  22. IncidentDetector creates incidents only for P1/P2 signals.
  23. Signal deduplication and correlation windows are enforced.
  24. All OTEL semantic conventions use the agent.* namespace.
  25. All metric instruments follow Prometheus naming conventions.
  26. All components fail closed on internal error.
Conformance tests MUST cover: SLO evaluation across all five status values; error budget exhaustion and burn rate computation; circuit breaker state transitions; chaos experiment lifecycle; alert dispatch to all channel types; alert deduplication; trace capture, redaction, and storage; golden trace comparison; artifact signing round-trip; incident creation from P1/P2 signals; and signal correlation.
The reference conformance test suite contains 111 tests. Cross-language SDK compatibility is verified by sharing the same SLO configuration objects and asserting identical status transitions across all SDK implementations.

Build docs developers (and LLMs) love