Documentation Index Fetch the complete documentation index at: https://mintlify.com/microsoft/agent-governance-toolkit/llms.txt
Use this file to discover all available pages before exploring further.
Agent SRE brings Site Reliability Engineering disciplines to autonomous AI agents — treating agents like services with measurable, enforceable reliability targets. Traditional services fail in predictable ways (timeouts, crashes). Agents add new failure modes: infinite tool-call loops that burn tokens, silent behavioral drift after model updates, runaway cost from a single task, and cascading failures when one agent’s retry storm overwhelms downstream agents. The agent-sre package gives you the building blocks to detect, contain, and recover from all of these.
Why SRE for AI Agents?
Failure Mode Traditional Service AI Agent Runaway loops Process hangs Agent calls tools in an infinite loop, burning tokens Behavioral drift Bug in new deploy Model update changes decision patterns silently Cost explosion Resource leak Single task consumes $500 in API calls Rogue behavior Compromised service Agent uses unauthorized tools or exfiltrates data Cascading failure Service dependency down Agent A fails → Agent B retries → Agent C overloaded
Combine Agent SRE monitoring with the AGT governance audit trail for complete observability: SRE tells you when an agent is failing, the audit log tells you what it was doing and which policy applied at the time of each failure.
SLO Engine
Define what “reliable” means for your agents with Service Level Objectives backed by error budgets and burn rate alerts.
Available SLI Types
SLI Type What It Measures Example Target Latency Task completion time p99 < 10s Error rate Fraction of failed tasks < 1% Cost Per-task spend < $0.50/task Token usage Tokens per completion < 4096 Hallucination Factual accuracy score > 95% Tool success Tool call success rate > 99% Human feedback User satisfaction score > 4.0/5.0
Define an SLI and SLO
from agent_sre.slo import SLI , SLIValue, SLO , ErrorBudget
from agent_sre.slo.objectives import BurnRateAlert, ExhaustionAction
class TaskSuccessRateSLI ( SLI ):
"""Tracks task success rate."""
def collect ( self ) -> SLIValue:
values = self .values_in_window()
if not values:
return self .record( 1.0 )
good = sum ( 1 for v in values if v.is_good)
return self .record(good / len (values))
# 99.5% success rate target over a 24h window
success_sli = TaskSuccessRateSLI(
name = "task_success_rate" ,
target = 0.995 ,
window = "24h" ,
)
slo = SLO(
name = "code-reviewer-reliability" ,
indicators = [success_sli],
error_budget = ErrorBudget(
total = 0.005 , # 0.5% error budget (1 - 0.995)
window_seconds = 2_592_000 , # 30-day window
burn_rate_alert = 2.0 , # Warn at 2× burn rate
burn_rate_critical = 10.0 , # Critical at 10× burn rate
exhaustion_action = ExhaustionAction. THROTTLE , # Auto-throttle on exhaustion
),
agent_id = "code-reviewer" ,
)
Record Events and Check Status
# Record outcomes as they happen
for _ in range ( 95 ):
slo.error_budget.record_event( good = True )
for _ in range ( 5 ):
slo.error_budget.record_event( good = False )
# Check error budget
budget = slo.error_budget
print ( f "Budget remaining: { budget.remaining_percent :.1f} %" )
print ( f "Exhausted: { budget.is_exhausted } " )
# Check burn rate (are we burning budget too fast?)
burn_rate = budget.burn_rate( window_seconds = 3600 ) # Last hour
print ( f "1h burn rate: { burn_rate :.2f} x" )
# Check for firing alerts
for alert in budget.firing_alerts():
print ( f "🔥 { alert.name } : burn rate { burn_rate :.1f} x (threshold: { alert.rate } x)" )
Circuit Breakers
When an agent starts failing, you don’t want it to keep hammering downstream services. The circuit breaker isolates failing agents automatically.
CLOSED ──(failures >= threshold)──→ OPEN ──(timeout elapsed)──→ HALF_OPEN
↑ │
└──────────(success)────────────────────────────────────────────←─┘
(failure) ──→ OPEN
Setup and Usage
from agent_sre.cascade.circuit_breaker import (
CircuitBreaker,
CircuitBreakerConfig,
CircuitOpenError,
)
# Open after 3 failures, test recovery after 30s
config = CircuitBreakerConfig(
failure_threshold = 3 ,
recovery_timeout_seconds = 30.0 ,
half_open_max_calls = 1 ,
)
breaker = CircuitBreaker( agent_id = "data-analyst" , config = config)
def run_agent_task ( task : dict ) -> str :
"""Your agent's main function."""
# ... agent logic ...
return "result"
# The circuit breaker wraps the call
try :
result = breaker.call(run_agent_task, { "query" : "revenue Q3" })
print ( f "Result: { result } " )
except CircuitOpenError as e:
print ( f "Agent isolated: { e } " )
print ( f "Retry after: { e.retry_after :.0f} s" )
Manual Control
# Manual success/failure recording
try :
result = run_agent_task(task)
breaker.record_success()
except Exception :
breaker.record_failure()
raise
# Check state
print ( f "State: { breaker.state } " ) # CLOSED, OPEN, or HALF_OPEN
print ( f "Failures: { breaker.failure_count } " )
# Manual reset after deploying a fix
breaker.reset()
Fleet Management
class AgentFleetBreakers :
"""Manage circuit breakers for a fleet of agents."""
def __init__ ( self , config : CircuitBreakerConfig | None = None ):
self ._config = config or CircuitBreakerConfig()
self ._breakers: dict[ str , CircuitBreaker] = {}
def get ( self , agent_id : str ) -> CircuitBreaker:
if agent_id not in self ._breakers:
self ._breakers[agent_id] = CircuitBreaker(agent_id, self ._config)
return self ._breakers[agent_id]
def open_circuits ( self ) -> list[ str ]:
return [aid for aid, cb in self ._breakers.items() if cb.state == "OPEN" ]
fleet = AgentFleetBreakers(
config = CircuitBreakerConfig( failure_threshold = 5 , recovery_timeout_seconds = 60.0 ),
)
Rogue Agent Detection
The RogueAgentDetector (OWASP ASI-10) combines three signals to flag compromised or malfunctioning agents:
Tool-call frequency — z-score spike detection over a sliding window
Action entropy — flags both suspiciously repetitive and chaotic behavior
Capability violations — tools used outside the agent’s allowed profile
from agent_sre.anomaly import RogueAgentDetector, RogueDetectorConfig, RiskLevel
config = RogueDetectorConfig(
frequency_window_seconds = 60.0 ,
frequency_z_threshold = 2.5 ,
entropy_low_threshold = 0.3 , # Too repetitive (possible loop)
entropy_high_threshold = 3.5 , # Too chaotic (possible compromise)
quarantine_risk_level = RiskLevel. HIGH ,
)
detector = RogueAgentDetector( config = config)
# Register allowed tools per agent
detector.register_capability_profile(
agent_id = "support-agent" ,
allowed_tools = [ "search_kb" , "create_ticket" , "send_email" ],
)
# Agent starts using unauthorized tools rapidly
for i in range ( 50 ):
detector.record_action(
agent_id = "support-agent" ,
action = "exfiltrate" ,
tool_name = "shell_exec" , # Not in allowed tools!
timestamp = time.time() + i * 0.5 ,
)
assessment = detector.assess( "support-agent" )
print ( f "Risk: { assessment.risk_level.value } " ) # "high" or "critical"
print ( f "Quarantine? { assessment.quarantine_recommended } " ) # True
Chaos Testing
Inject faults into your agent pipeline to verify resilience before production incidents find the gaps.
from agent_sre.chaos import (
ChaosExperiment,
Fault,
FaultType,
AbortCondition,
)
# Create faults to inject
faults = [
Fault.latency_injection( "openai-api" , delay_ms = 5000 , rate = 0.3 ),
Fault.error_injection( "search_tool" , error = "timeout" , rate = 0.1 ),
Fault.timeout_injection( "database" , delay_ms = 30000 , rate = 0.05 ),
]
# Safety: abort if success rate drops below 50%
abort_conditions = [
AbortCondition( metric = "success_rate" , threshold = 0.5 , comparator = "lte" ),
]
experiment = ChaosExperiment(
name = "llm-latency-resilience" ,
target_agent = "code-reviewer" ,
faults = faults,
duration_seconds = 1800 , # 30 minutes
abort_conditions = abort_conditions,
blast_radius = 0.3 , # Affect 30% of traffic
description = "Verify code-reviewer handles LLM latency gracefully" ,
)
experiment.start()
# Periodically check abort conditions
metrics = { "success_rate" : 0.85 , "latency_p99" : 8500 }
if experiment.check_abort(metrics):
print ( f "Aborted: { experiment.abort_reason } " )
else :
score = experiment.calculate_resilience(
baseline_success_rate = 0.99 ,
experiment_success_rate = 0.85 ,
)
experiment.complete( resilience = score)
print ( f "Resilience: { experiment.resilience.overall :.0f} /100" )
Adversarial Chaos Testing
Test security boundaries with security-focused fault types:
security_faults = [
Fault.prompt_injection( "code-reviewer" , technique = "direct_override" ),
Fault.privilege_escalation( "code-reviewer" , target_role = "admin" ),
Fault.tool_abuse( "code-reviewer" , tool_name = "shell_exec" ),
]
security_experiment = ChaosExperiment(
name = "security-boundary-test" ,
target_agent = "code-reviewer" ,
faults = security_faults,
duration_seconds = 600 ,
description = "Verify agent rejects adversarial inputs" ,
)
Cost Controls
Prevent runaway spending with per-task budgets, auto-throttle, and a kill-switch.
from agent_sre.cost import CostGuard, BudgetAction
guard = CostGuard(
per_task_limit = 2.00 , # Max $2 per task
per_agent_daily_limit = 50.00 , # Max $50/day per agent
org_monthly_budget = 5000.00 , # Org-wide cap
auto_throttle = True , # Throttle at 85% daily budget
kill_switch_threshold = 0.95 , # Kill agent at 95% daily budget
anomaly_detection = True , # Detect cost spikes
)
# Pre-flight check
allowed, reason = guard.check_task( "research-agent" , estimated_cost = 1.50 )
# Record actual cost after task
alerts = guard.record_cost(
agent_id = "research-agent" ,
task_id = "task-001" ,
cost_usd = 0.45 ,
breakdown = { "input_tokens" : 0.15 , "output_tokens" : 0.25 , "tool_calls" : 0.05 },
)
for alert in alerts:
if alert.action == BudgetAction. KILL :
print ( "🛑 Agent killed — stop all tasks immediately" )
elif alert.action == BudgetAction. THROTTLE :
print ( "⚠ Agent throttled — reduce task rate" )
Production SRE Pipeline
Combine all components into a production-ready pipeline:
"""Production SRE pipeline for AI agents."""
from agent_sre.anomaly import RogueAgentDetector, RogueDetectorConfig, RiskLevel
from agent_sre.cascade.circuit_breaker import CircuitBreaker, CircuitBreakerConfig, CircuitOpenError
from agent_sre.slo import SLI , SLIValue, SLO , ErrorBudget
from agent_sre.slo.objectives import ExhaustionAction
from agent_sre.cost import CostGuard, BudgetAction
AGENT_ID = "production-agent"
# Configure all components
rogue_detector = RogueAgentDetector(
config = RogueDetectorConfig(
frequency_z_threshold = 3.0 ,
quarantine_risk_level = RiskLevel. HIGH ,
),
)
rogue_detector.register_capability_profile(
AGENT_ID ,
allowed_tools = [ "search" , "read_file" , "write_file" , "run_tests" ],
)
breaker = CircuitBreaker(
agent_id = AGENT_ID ,
config = CircuitBreakerConfig(
failure_threshold = 5 ,
recovery_timeout_seconds = 60.0 ,
),
)
class SuccessRateSLI ( SLI ):
def collect ( self ) -> SLIValue:
values = self .values_in_window()
if not values:
return self .record( 1.0 )
good = sum ( 1 for v in values if v.is_good)
return self .record(good / len (values))
slo = SLO(
name = f " { AGENT_ID } -reliability" ,
indicators = [SuccessRateSLI( name = "success_rate" , target = 0.995 , window = "24h" )],
error_budget = ErrorBudget(
total = 0.005 ,
exhaustion_action = ExhaustionAction. CIRCUIT_BREAK ,
),
agent_id = AGENT_ID ,
)
cost_guard = CostGuard(
per_task_limit = 2.00 ,
per_agent_daily_limit = 100.00 ,
auto_throttle = True ,
kill_switch_threshold = 0.95 ,
)
The pipeline gives you:
Layer Component Protection Pre-flight CostGuard.check_taskBlocks tasks that would exceed budget Pre-flight RogueAgentDetector.assessQuarantines compromised agents Execution CircuitBreaker.callIsolates failing agents Post-flight ErrorBudget.record_eventTracks reliability over time Post-flight CostGuard.record_costDetects cost anomalies, auto-throttles Post-flight RogueAgentDetector.record_actionBuilds behavioral baseline
Progressive Delivery and Replay Debugging
Progressive Delivery Use agent_sre.delivery.BlueGreenManager to safely roll out new agent versions with validation and auto-rollback. Canary releases let you test policy updates on a slice of traffic before full deployment.
Replay Debugging agent_sre includes deterministic replay of agent sessions for post-incident analysis. Traces capture sufficient detail to allow re-execution and regression detection against golden traces.
Alerting Connect agent_sre.alerts.AlertManager to your notification system (PagerDuty, Slack, Teams) for burn rate alerts, circuit breaker state transitions, and cost anomalies.
Scheduled Chaos Use agent_sre.chaos.ChaosScheduler for recurring resilience tests with blackout windows. Run weekly to catch regressions before production traffic does.