Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/microsoft/agent-governance-toolkit/llms.txt

Use this file to discover all available pages before exploring further.

Agent SRE brings Site Reliability Engineering disciplines to autonomous AI agents — treating agents like services with measurable, enforceable reliability targets. Traditional services fail in predictable ways (timeouts, crashes). Agents add new failure modes: infinite tool-call loops that burn tokens, silent behavioral drift after model updates, runaway cost from a single task, and cascading failures when one agent’s retry storm overwhelms downstream agents. The agent-sre package gives you the building blocks to detect, contain, and recover from all of these.
pip install agent-sre

Why SRE for AI Agents?

Failure ModeTraditional ServiceAI Agent
Runaway loopsProcess hangsAgent calls tools in an infinite loop, burning tokens
Behavioral driftBug in new deployModel update changes decision patterns silently
Cost explosionResource leakSingle task consumes $500 in API calls
Rogue behaviorCompromised serviceAgent uses unauthorized tools or exfiltrates data
Cascading failureService dependency downAgent A fails → Agent B retries → Agent C overloaded
Combine Agent SRE monitoring with the AGT governance audit trail for complete observability: SRE tells you when an agent is failing, the audit log tells you what it was doing and which policy applied at the time of each failure.

SLO Engine

Define what “reliable” means for your agents with Service Level Objectives backed by error budgets and burn rate alerts.

Available SLI Types

SLI TypeWhat It MeasuresExample Target
LatencyTask completion timep99 < 10s
Error rateFraction of failed tasks< 1%
CostPer-task spend< $0.50/task
Token usageTokens per completion< 4096
HallucinationFactual accuracy score> 95%
Tool successTool call success rate> 99%
Human feedbackUser satisfaction score> 4.0/5.0

Define an SLI and SLO

from agent_sre.slo import SLI, SLIValue, SLO, ErrorBudget
from agent_sre.slo.objectives import BurnRateAlert, ExhaustionAction

class TaskSuccessRateSLI(SLI):
    """Tracks task success rate."""

    def collect(self) -> SLIValue:
        values = self.values_in_window()
        if not values:
            return self.record(1.0)
        good = sum(1 for v in values if v.is_good)
        return self.record(good / len(values))

# 99.5% success rate target over a 24h window
success_sli = TaskSuccessRateSLI(
    name="task_success_rate",
    target=0.995,
    window="24h",
)

slo = SLO(
    name="code-reviewer-reliability",
    indicators=[success_sli],
    error_budget=ErrorBudget(
        total=0.005,                               # 0.5% error budget (1 - 0.995)
        window_seconds=2_592_000,                   # 30-day window
        burn_rate_alert=2.0,                        # Warn at 2× burn rate
        burn_rate_critical=10.0,                    # Critical at 10× burn rate
        exhaustion_action=ExhaustionAction.THROTTLE, # Auto-throttle on exhaustion
    ),
    agent_id="code-reviewer",
)

Record Events and Check Status

# Record outcomes as they happen
for _ in range(95):
    slo.error_budget.record_event(good=True)

for _ in range(5):
    slo.error_budget.record_event(good=False)

# Check error budget
budget = slo.error_budget
print(f"Budget remaining: {budget.remaining_percent:.1f}%")
print(f"Exhausted: {budget.is_exhausted}")

# Check burn rate (are we burning budget too fast?)
burn_rate = budget.burn_rate(window_seconds=3600)  # Last hour
print(f"1h burn rate: {burn_rate:.2f}x")

# Check for firing alerts
for alert in budget.firing_alerts():
    print(f"🔥 {alert.name}: burn rate {burn_rate:.1f}x (threshold: {alert.rate}x)")

Circuit Breakers

When an agent starts failing, you don’t want it to keep hammering downstream services. The circuit breaker isolates failing agents automatically.
CLOSED ──(failures >= threshold)──→ OPEN ──(timeout elapsed)──→ HALF_OPEN
  ↑                                                                │
  └──────────(success)────────────────────────────────────────────←─┘
                                             (failure) ──→ OPEN

Setup and Usage

from agent_sre.cascade.circuit_breaker import (
    CircuitBreaker,
    CircuitBreakerConfig,
    CircuitOpenError,
)

# Open after 3 failures, test recovery after 30s
config = CircuitBreakerConfig(
    failure_threshold=3,
    recovery_timeout_seconds=30.0,
    half_open_max_calls=1,
)

breaker = CircuitBreaker(agent_id="data-analyst", config=config)

def run_agent_task(task: dict) -> str:
    """Your agent's main function."""
    # ... agent logic ...
    return "result"

# The circuit breaker wraps the call
try:
    result = breaker.call(run_agent_task, {"query": "revenue Q3"})
    print(f"Result: {result}")
except CircuitOpenError as e:
    print(f"Agent isolated: {e}")
    print(f"Retry after: {e.retry_after:.0f}s")

Manual Control

# Manual success/failure recording
try:
    result = run_agent_task(task)
    breaker.record_success()
except Exception:
    breaker.record_failure()
    raise

# Check state
print(f"State: {breaker.state}")            # CLOSED, OPEN, or HALF_OPEN
print(f"Failures: {breaker.failure_count}")

# Manual reset after deploying a fix
breaker.reset()

Fleet Management

class AgentFleetBreakers:
    """Manage circuit breakers for a fleet of agents."""

    def __init__(self, config: CircuitBreakerConfig | None = None):
        self._config = config or CircuitBreakerConfig()
        self._breakers: dict[str, CircuitBreaker] = {}

    def get(self, agent_id: str) -> CircuitBreaker:
        if agent_id not in self._breakers:
            self._breakers[agent_id] = CircuitBreaker(agent_id, self._config)
        return self._breakers[agent_id]

    def open_circuits(self) -> list[str]:
        return [aid for aid, cb in self._breakers.items() if cb.state == "OPEN"]

fleet = AgentFleetBreakers(
    config=CircuitBreakerConfig(failure_threshold=5, recovery_timeout_seconds=60.0),
)

Rogue Agent Detection

The RogueAgentDetector (OWASP ASI-10) combines three signals to flag compromised or malfunctioning agents:
  1. Tool-call frequency — z-score spike detection over a sliding window
  2. Action entropy — flags both suspiciously repetitive and chaotic behavior
  3. Capability violations — tools used outside the agent’s allowed profile
from agent_sre.anomaly import RogueAgentDetector, RogueDetectorConfig, RiskLevel

config = RogueDetectorConfig(
    frequency_window_seconds=60.0,
    frequency_z_threshold=2.5,
    entropy_low_threshold=0.3,     # Too repetitive (possible loop)
    entropy_high_threshold=3.5,    # Too chaotic (possible compromise)
    quarantine_risk_level=RiskLevel.HIGH,
)

detector = RogueAgentDetector(config=config)

# Register allowed tools per agent
detector.register_capability_profile(
    agent_id="support-agent",
    allowed_tools=["search_kb", "create_ticket", "send_email"],
)

# Agent starts using unauthorized tools rapidly
for i in range(50):
    detector.record_action(
        agent_id="support-agent",
        action="exfiltrate",
        tool_name="shell_exec",       # Not in allowed tools!
        timestamp=time.time() + i * 0.5,
    )

assessment = detector.assess("support-agent")
print(f"Risk: {assessment.risk_level.value}")        # "high" or "critical"
print(f"Quarantine? {assessment.quarantine_recommended}")  # True

Chaos Testing

Inject faults into your agent pipeline to verify resilience before production incidents find the gaps.
from agent_sre.chaos import (
    ChaosExperiment,
    Fault,
    FaultType,
    AbortCondition,
)

# Create faults to inject
faults = [
    Fault.latency_injection("openai-api", delay_ms=5000, rate=0.3),
    Fault.error_injection("search_tool", error="timeout", rate=0.1),
    Fault.timeout_injection("database", delay_ms=30000, rate=0.05),
]

# Safety: abort if success rate drops below 50%
abort_conditions = [
    AbortCondition(metric="success_rate", threshold=0.5, comparator="lte"),
]

experiment = ChaosExperiment(
    name="llm-latency-resilience",
    target_agent="code-reviewer",
    faults=faults,
    duration_seconds=1800,       # 30 minutes
    abort_conditions=abort_conditions,
    blast_radius=0.3,            # Affect 30% of traffic
    description="Verify code-reviewer handles LLM latency gracefully",
)

experiment.start()

# Periodically check abort conditions
metrics = {"success_rate": 0.85, "latency_p99": 8500}
if experiment.check_abort(metrics):
    print(f"Aborted: {experiment.abort_reason}")
else:
    score = experiment.calculate_resilience(
        baseline_success_rate=0.99,
        experiment_success_rate=0.85,
    )
    experiment.complete(resilience=score)

print(f"Resilience: {experiment.resilience.overall:.0f}/100")

Adversarial Chaos Testing

Test security boundaries with security-focused fault types:
security_faults = [
    Fault.prompt_injection("code-reviewer", technique="direct_override"),
    Fault.privilege_escalation("code-reviewer", target_role="admin"),
    Fault.tool_abuse("code-reviewer", tool_name="shell_exec"),
]

security_experiment = ChaosExperiment(
    name="security-boundary-test",
    target_agent="code-reviewer",
    faults=security_faults,
    duration_seconds=600,
    description="Verify agent rejects adversarial inputs",
)

Cost Controls

Prevent runaway spending with per-task budgets, auto-throttle, and a kill-switch.
from agent_sre.cost import CostGuard, BudgetAction

guard = CostGuard(
    per_task_limit=2.00,          # Max $2 per task
    per_agent_daily_limit=50.00,  # Max $50/day per agent
    org_monthly_budget=5000.00,   # Org-wide cap
    auto_throttle=True,           # Throttle at 85% daily budget
    kill_switch_threshold=0.95,   # Kill agent at 95% daily budget
    anomaly_detection=True,       # Detect cost spikes
)

# Pre-flight check
allowed, reason = guard.check_task("research-agent", estimated_cost=1.50)

# Record actual cost after task
alerts = guard.record_cost(
    agent_id="research-agent",
    task_id="task-001",
    cost_usd=0.45,
    breakdown={"input_tokens": 0.15, "output_tokens": 0.25, "tool_calls": 0.05},
)

for alert in alerts:
    if alert.action == BudgetAction.KILL:
        print("🛑 Agent killed — stop all tasks immediately")
    elif alert.action == BudgetAction.THROTTLE:
        print("⚠ Agent throttled — reduce task rate")

Production SRE Pipeline

Combine all components into a production-ready pipeline:
"""Production SRE pipeline for AI agents."""

from agent_sre.anomaly import RogueAgentDetector, RogueDetectorConfig, RiskLevel
from agent_sre.cascade.circuit_breaker import CircuitBreaker, CircuitBreakerConfig, CircuitOpenError
from agent_sre.slo import SLI, SLIValue, SLO, ErrorBudget
from agent_sre.slo.objectives import ExhaustionAction
from agent_sre.cost import CostGuard, BudgetAction

AGENT_ID = "production-agent"

# Configure all components
rogue_detector = RogueAgentDetector(
    config=RogueDetectorConfig(
        frequency_z_threshold=3.0,
        quarantine_risk_level=RiskLevel.HIGH,
    ),
)
rogue_detector.register_capability_profile(
    AGENT_ID,
    allowed_tools=["search", "read_file", "write_file", "run_tests"],
)

breaker = CircuitBreaker(
    agent_id=AGENT_ID,
    config=CircuitBreakerConfig(
        failure_threshold=5,
        recovery_timeout_seconds=60.0,
    ),
)

class SuccessRateSLI(SLI):
    def collect(self) -> SLIValue:
        values = self.values_in_window()
        if not values:
            return self.record(1.0)
        good = sum(1 for v in values if v.is_good)
        return self.record(good / len(values))

slo = SLO(
    name=f"{AGENT_ID}-reliability",
    indicators=[SuccessRateSLI(name="success_rate", target=0.995, window="24h")],
    error_budget=ErrorBudget(
        total=0.005,
        exhaustion_action=ExhaustionAction.CIRCUIT_BREAK,
    ),
    agent_id=AGENT_ID,
)

cost_guard = CostGuard(
    per_task_limit=2.00,
    per_agent_daily_limit=100.00,
    auto_throttle=True,
    kill_switch_threshold=0.95,
)
The pipeline gives you:
LayerComponentProtection
Pre-flightCostGuard.check_taskBlocks tasks that would exceed budget
Pre-flightRogueAgentDetector.assessQuarantines compromised agents
ExecutionCircuitBreaker.callIsolates failing agents
Post-flightErrorBudget.record_eventTracks reliability over time
Post-flightCostGuard.record_costDetects cost anomalies, auto-throttles
Post-flightRogueAgentDetector.record_actionBuilds behavioral baseline

Progressive Delivery and Replay Debugging

Progressive Delivery

Use agent_sre.delivery.BlueGreenManager to safely roll out new agent versions with validation and auto-rollback. Canary releases let you test policy updates on a slice of traffic before full deployment.

Replay Debugging

agent_sre includes deterministic replay of agent sessions for post-incident analysis. Traces capture sufficient detail to allow re-execution and regression detection against golden traces.

Alerting

Connect agent_sre.alerts.AlertManager to your notification system (PagerDuty, Slack, Teams) for burn rate alerts, circuit breaker state transitions, and cost anomalies.

Scheduled Chaos

Use agent_sre.chaos.ChaosScheduler for recurring resilience tests with blackout windows. Run weekly to catch regressions before production traffic does.

Build docs developers (and LLMs) love