Documentation Index Fetch the complete documentation index at: https://mintlify.com/microsoft/agent-governance-toolkit/llms.txt
Use this file to discover all available pages before exploring further.
This specification defines the Site Reliability Engineering (SRE) governance layer for autonomous AI agents. Just as traditional SRE applies error budgets, SLOs, and incident management to software services, Agent SRE applies these same principles to AI agent operations — treating agent reliability as a measurable, enforceable, and continuously improvable property. All SDK implementations MUST conform to this specification. The key words are interpreted as described in RFC 2119 .
Conformance Tests 111 tests
Design Principles
Measure everything. Every agent operation SHOULD produce SLI measurements that feed SLO evaluation.
Fail closed. All enforcement and detection components MUST deny or isolate on internal error, never silently permit.
Budget-driven decisions. Deployment, throttling, and circuit breaking decisions MUST be driven by error budget state.
Deterministic replay. Traces MUST capture sufficient detail to allow deterministic re-execution and regression detection.
Append-only audit. Trace hashes, alert histories, and incident records MUST be tamper-evident.
SLO Definition and Measurement
SLO Model
An SLO MUST combine one or more SLIs with an error budget to define agent reliability.
Field Type Required Default Constraints namestring Yes — Non-empty, unique within a deployment indicatorslist[SLI] Yes — At least one SLI error_budgetErrorBudget No auto-derived Derived from strictest indicator if omitted descriptionstring No ""Free-form text labelsdict[str, str] No {}Arbitrary key-value metadata alert_managerAlertManager No null If set, alerts fire on status transitions agent_idstring No ""DID or identifier of the owning agent
SLO Health States
Implementations MUST define the following SLO health states in this severity order:
Status Value Meaning HEALTHY0 Within budget, no alerts firing UNKNOWN1 Insufficient data to evaluate WARNING2 Burn rate elevated, warning-level alert firing CRITICAL3 Burn rate critical, budget at risk EXHAUSTED4 Error budget fully consumed
Alert transitions MUST fire only when severity increases (worsens) or when the SLO recovers to HEALTHY.
SLO Evaluation Precedence
The evaluate() method MUST determine status using:
If error_budget.is_exhausted is true → EXHAUSTED
If any firing alert has severity "critical" → CRITICAL
If any firing alert has severity "warning" → WARNING
If no indicator has a non-None current_value() → UNKNOWN
Otherwise → HEALTHY
Exhaustion Actions
When the error budget is fully consumed, the SLO engine MUST support:
Action Meaning ALERTSend an alert only FREEZE_DEPLOYMENTSHalt new agent deployments CIRCUIT_BREAKOpen the agent’s circuit breaker THROTTLEReduce the agent’s request rate
Auto-Budget Derivation
When no explicit ErrorBudget is provided, the SLO MUST derive the total budget as:
error_budget.total = 1.0 - min(sli.target for sli in indicators)
Service Level Indicators
Built-in SLI Types
Implementations MUST provide these eight SLI types with their default targets:
SLI Type Metric Name Default Target Default Window Semantics TaskSuccessRatetask_success_rate0.995 (99.5%) 30d Fraction of tasks completed successfully ToolCallAccuracytool_call_accuracy0.999 (99.9%) 7d Fraction of tool calls selecting the correct tool ResponseLatencyresponse_latency_p{N}5000 ms 1h Response latency at configurable percentile (default p95) CostPerTaskcost_per_task$0.50 24h Average cost per task in USD PolicyCompliancepolicy_compliance1.0 (100%) 24h Adherence to Agent OS governance policies DelegationChainDepthscope_chain_depth3 (max depth) 24h Maximum delegation chain depth; lower is better HallucinationRatehallucination_rate0.05 (5%) 24h Hallucination rate via LLM-as-judge; lower is better CalibrationDeltaSLIcalibration_delta0.05 30d Gap between predicted confidence and actual success rate; lower is better
Inverted SLIs
For DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI, the compliance() method MUST return the fraction of measurements where value <= target (not value >= target).
TimeWindow Enum
Implementations MUST support these standard time windows:
Value Label Seconds HOUR_1"1h"3,600 HOUR_6"6h"21,600 DAY_1"24h"86,400 DAY_7"7d"604,800 DAY_30"30d"2,592,000
ResponseLatency Percentile
The ResponseLatency SLI MUST compute current_value() as the configured percentile of recorded latencies (not the mean). The percentile index MUST be:
idx = min(int(len(sorted_values) * percentile), len(sorted_values) - 1)
Error Budget Tracking
remaining = total - consumed
is_exhausted = consumed >= total
When error_budget.total == 0: remaining MUST return 0.0; is_exhausted MUST return True if consumed >= 0. Implementations MUST NOT divide by zero.
Burn Rate Computation
burn_rate = consumed_in_window / expected_consumption_in_window
A burn rate of 1.0 means consuming at exactly the allowed pace. Burn rate > 1.0 means budget is depleting faster than planned.
Edge cases:
If expected_consumption == 0.0, burn rate MUST return 0.0.
If total == 0.0, burn rate MUST return 0.0 when consumed == 0.0, or positive infinity otherwise.
BurnRateAlert
Implementations MUST support burn rate alert thresholds:
alert fires when: current_burn_rate >= threshold.rate
alert resolves when: current_burn_rate < threshold.rate
Alerts are classified as warning (burn rate ≥ 2.0) or critical (burn rate ≥ 10.0) by default.
Circuit Breaker States and Transitions
CircuitState Enum
Implementations MUST define exactly three states:
State Meaning CLOSEDNormal operation. Failures are counted. OPENFailures exceeded threshold. All calls rejected. HALF_OPENTesting recovery. One probe call allowed through.
State Transition Rules
CLOSED → OPEN: failure_count >= failure_threshold
OPEN → HALF_OPEN: reset_timeout has elapsed
HALF_OPEN → CLOSED: probe call succeeds
HALF_OPEN → OPEN: probe call fails
In Public Preview, HALF_OPEN is not auto-entered. Manual recovery via force_close() or reset() is required.
Circuit Breaker Configuration
Field Default Description failure_threshold5 Consecutive failures before opening reset_timeout30s Time before attempting recovery (HALF_OPEN) half_open_max_calls1 Probe calls allowed in HALF_OPEN state
Special Transitions
force_open() called when already OPEN MUST be a no-op.
force_close() called when already CLOSED MUST be a no-op.
The circuit breaker feeds the Hypervisor kill switch: when the breaker opens, it SHOULD trigger the kill switch for the associated agent.
Chaos Injection Types
Implementations MUST support all twelve fault types:
FaultType Description LATENCY_INJECTIONIntroduce configurable delay before action execution ERROR_INJECTIONInject a synthetic error/exception TIMEOUT_INJECTIONForce a timeout during execution RESOURCE_EXHAUSTIONSimulate memory or CPU limit reached NETWORK_PARTITIONBlock network access for the target DATA_CORRUPTIONReturn corrupted/malformed responses CLOCK_SKEWAdvance or retard the agent’s perceived time DEPENDENCY_FAILUREFail a specific downstream dependency PARTIAL_FAILURESucceed for a fraction of calls CASCADE_FAILURETrigger failures in dependent agents ADVERSARIAL_INPUTInject adversarial prompts or inputs RATE_LIMIT_EXCEEDEDSimulate rate limit responses from dependencies
Blast Radius
The blast_radius field MUST be clamped to [0.0, 1.0]:
blast_radius = -0.5 MUST be clamped to 0.0.
blast_radius = 1.5 MUST be clamped to 1.0.
blast_radius = 0.0 means no traffic is affected.
Abort Conditions
Abort conditions halt experiments when triggered:
abort when: actual_metric_value {comparator} threshold
When check_abort(metrics_snapshot) detects a triggered condition, the experiment state MUST transition to ABORTED and abort_reason MUST be set.
Progressive Delivery (Canary Releases)
The SRE specification includes progressive delivery semantics tied to error budget state:
When error_budget.is_exhausted, the FREEZE_DEPLOYMENTS exhaustion action MUST halt new agent deployments.
Canary rollouts MUST observe SLO status before advancing traffic percentage.
If SLO status worsens during a canary rollout, the rollout MUST be paused or rolled back.
Traffic routing decisions MUST be driven by error budget state:
Budget State Recommended Action > 50% remaining Full velocity deployments permitted 25–50% remaining Require SLO review before deployment < 25% remaining Freeze deployments Exhausted Mandatory freeze; circuit break
Replay Debugging Semantics
Trace Model
A Trace represents a complete agent execution captured for replay:
Field Type Description trace_idstring Unique trace identifier agent_idstring DID of the executing agent taskstring Task description or prompt spanslist[Span] All execution spans content_hashstring SHA-256 of trace metadata (tamper detection) finished_atdatetime or null Completion timestamp
Span Kinds
Implementations MUST define exactly six span kinds:
SpanKind Description LLM_INFERENCELLM model call TOOL_CALLTool invocation POLICY_CHECKGovernance policy evaluation DELEGATIONAgent-to-agent delegation MEMORY_READContext or memory retrieval MEMORY_WRITEContext or memory storage
Content Hash Invariant
Trace content_hash MUST be computed as SHA-256 of the trace metadata (excluding spans). Any modification to trace metadata MUST invalidate the hash, enabling tamper detection.
PII Redaction
PII redaction MUST be applied before trace persistence. Traces MUST NOT be written to disk or returned by the API with unredacted PII.
TraceStore Path Traversal
TraceStore MUST reject path traversal attempts. Any requested trace path containing .. MUST be rejected.
Golden Traces
A golden trace is a captured execution marked as the expected-correct reference for regression testing. Golden trace comparisons MUST report deviations in: span count, span kinds, tool call sequence, cost, and latency.
Alerting
Supported Alert Channels
Implementations MUST support all six channel types:
AlertChannel Description WEBHOOKHTTP POST to a configured URL EMAILEmail dispatch SLACKSlack webhook PAGERDUTYPagerDuty Events API TEAMSMicrosoft Teams webhook NULLDiscards alerts (testing/dev)
Alert Deduplication
Duplicate alerts with the same dedup_key within the dedup_window_seconds MUST be suppressed. A RESOLVED alert MUST clear the dedup cache entry even if the original alert was never deduplicated.
send(alert_1, dedup_key="agent-1:slo-1") → dispatched
send(alert_2, dedup_key="agent-1:slo-1") at T+60s → suppressed (within 300s window)
send(resolved, dedup_key="agent-1:slo-1") → dispatched, clears cache
send(alert_3, dedup_key="agent-1:slo-1") → dispatched (cache was cleared)
Incident Detection
Signal Classification
The IncidentDetector creates incidents only for P1 and P2 signals:
Signal Type Default Severity SLO_BREACHP2 COST_ANOMALYP2 POLICY_VIOLATIONP1 TRUST_DEGRADATIONP2 HALLUCINATION_SPIKEP2 KILL_SWITCH_ACTIVATEDP1
Incident Correlation
When multiple signals arrive from the same source within correlation_window_seconds, they MUST be correlated into a single incident:
Title: "Correlated: {signal_types} from {source}"
Severity: highest severity among correlated signals
Actions: union of response actions from all signal types
Signal deduplication and correlation windows MUST be enforced. Only the first P1/P2 signal for a given source within the dedup window creates an incident; subsequent duplicates MUST be suppressed.
OpenTelemetry Integration
All OTEL semantic conventions MUST use the agent.* namespace:
Attribute Description agent.idAgent DID agent.ringExecution ring (0–3) agent.trust_scoreCurrent trust score agent.taskTask description agent.tool_nameTool name (for TOOL_CALL spans) agent.policy_decisionPolicy decision (for POLICY_CHECK spans) agent.delegation_depthChain depth (for DELEGATION spans)
All metric instruments MUST follow Prometheus naming conventions (snake_case, units in name, _total suffix for counters).
Artifact Signing
ArtifactSigner MUST use Ed25519 exclusively for signing agent build artifacts and SBOMs.
A SignatureBundle MUST contain:
Field Description signatureBase64-encoded Ed25519 signature public_keyBase64-encoded public key artifact_hashSHA-256 of artifact file contents timestampISO 8601 UTC signing time
sign_artifact(path) called on a non-existent path MUST raise a filesystem error. A bundle with empty or zero-length signature MUST NOT be returned.
A conforming implementation MUST :
SLO evaluation follows the specified precedence rules.
SLOStatus enum contains all five states with correct ordering.
ExhaustionAction enum contains all four values.
All eight built-in SLI types are implemented with correct defaults.
Inverted SLIs use value <= target for compliance.
Error budget remaining computation matches the formula.
Burn rate handles zero and infinite cases correctly.
CircuitState enum contains CLOSED, OPEN, and HALF_OPEN.
Circuit breaker transitions follow the specified rules.
All twelve FaultType values are defined.
Blast radius is clamped to [0.0, 1.0].
Abort conditions halt experiments when triggered.
All six AlertChannel types are supported.
Alert deduplication respects the dedup window.
RESOLVED alerts clear the dedup cache.
All six SpanKind values are defined.
Trace content hash uses SHA-256.
TraceStore rejects path traversal.
PII redaction is applied before trace persistence.
ArtifactSigner uses Ed25519 exclusively.
SignatureBundle supports round-trip serialization.
IncidentDetector creates incidents only for P1/P2 signals.
Signal deduplication and correlation windows are enforced.
All OTEL semantic conventions use the agent.* namespace.
All metric instruments follow Prometheus naming conventions.
All components fail closed on internal error.
Conformance tests MUST cover: SLO evaluation across all five status values; error budget exhaustion and burn rate computation; circuit breaker state transitions; chaos experiment lifecycle; alert dispatch to all channel types; alert deduplication; trace capture, redaction, and storage; golden trace comparison; artifact signing round-trip; incident creation from P1/P2 signals; and signal correlation.
The reference conformance test suite contains 111 tests. Cross-language SDK compatibility is verified by sharing the same SLO configuration objects and asserting identical status transitions across all SDK implementations.