Agent SRE: SLOs, Circuit Breakers, and Chaos Testing

Agent SRE brings Site Reliability Engineering disciplines to autonomous AI agents — treating agents like services with measurable, enforceable reliability targets. Traditional services fail in predictable ways (timeouts, crashes). Agents add new failure modes: infinite tool-call loops that burn tokens, silent behavioral drift after model updates, runaway cost from a single task, and cascading failures when one agent’s retry storm overwhelms downstream agents. The agent-sre package gives you the building blocks to detect, contain, and recover from all of these.

pip install agent-sre

Why SRE for AI Agents?

Failure Mode	Traditional Service	AI Agent
Runaway loops	Process hangs	Agent calls tools in an infinite loop, burning tokens
Behavioral drift	Bug in new deploy	Model update changes decision patterns silently
Cost explosion	Resource leak	Single task consumes $500 in API calls
Rogue behavior	Compromised service	Agent uses unauthorized tools or exfiltrates data
Cascading failure	Service dependency down	Agent A fails → Agent B retries → Agent C overloaded

Combine Agent SRE monitoring with the AGT governance audit trail for complete observability: SRE tells you when an agent is failing, the audit log tells you what it was doing and which policy applied at the time of each failure.

SLO Engine

Define what “reliable” means for your agents with Service Level Objectives backed by error budgets and burn rate alerts.

Available SLI Types

SLI Type	What It Measures	Example Target
Latency	Task completion time	p99 < 10s
Error rate	Fraction of failed tasks	< 1%
Cost	Per-task spend	< $0.50/task
Token usage	Tokens per completion	< 4096
Hallucination	Factual accuracy score	> 95%
Tool success	Tool call success rate	> 99%
Human feedback	User satisfaction score	> 4.0/5.0

Define an SLI and SLO

from agent_sre.slo import SLI, SLIValue, SLO, ErrorBudget
from agent_sre.slo.objectives import BurnRateAlert, ExhaustionAction

class TaskSuccessRateSLI(SLI):
    """Tracks task success rate."""

    def collect(self) -> SLIValue:
        values = self.values_in_window()
        if not values:
            return self.record(1.0)
        good = sum(1 for v in values if v.is_good)
        return self.record(good / len(values))

# 99.5% success rate target over a 24h window
success_sli = TaskSuccessRateSLI(
    name="task_success_rate",
    target=0.995,
    window="24h",
)

slo = SLO(
    name="code-reviewer-reliability",
    indicators=[success_sli],
    error_budget=ErrorBudget(
        total=0.005,                               # 0.5% error budget (1 - 0.995)
        window_seconds=2_592_000,                   # 30-day window
        burn_rate_alert=2.0,                        # Warn at 2× burn rate
        burn_rate_critical=10.0,                    # Critical at 10× burn rate
        exhaustion_action=ExhaustionAction.THROTTLE, # Auto-throttle on exhaustion
    ),
    agent_id="code-reviewer",
)

Record Events and Check Status

# Record outcomes as they happen
for _ in range(95):
    slo.error_budget.record_event(good=True)

for _ in range(5):
    slo.error_budget.record_event(good=False)

# Check error budget
budget = slo.error_budget
print(f"Budget remaining: {budget.remaining_percent:.1f}%")
print(f"Exhausted: {budget.is_exhausted}")

# Check burn rate (are we burning budget too fast?)
burn_rate = budget.burn_rate(window_seconds=3600)  # Last hour
print(f"1h burn rate: {burn_rate:.2f}x")

# Check for firing alerts
for alert in budget.firing_alerts():
    print(f"🔥 {alert.name}: burn rate {burn_rate:.1f}x (threshold: {alert.rate}x)")

Circuit Breakers

When an agent starts failing, you don’t want it to keep hammering downstream services. The circuit breaker isolates failing agents automatically.

CLOSED ──(failures >= threshold)──→ OPEN ──(timeout elapsed)──→ HALF_OPEN
  ↑                                                                │
  └──────────(success)────────────────────────────────────────────←─┘
                                             (failure) ──→ OPEN

Setup and Usage

from agent_sre.cascade.circuit_breaker import (
    CircuitBreaker,
    CircuitBreakerConfig,
    CircuitOpenError,
)

# Open after 3 failures, test recovery after 30s
config = CircuitBreakerConfig(
    failure_threshold=3,
    recovery_timeout_seconds=30.0,
    half_open_max_calls=1,
)

breaker = CircuitBreaker(agent_id="data-analyst", config=config)

def run_agent_task(task: dict) -> str:
    """Your agent's main function."""
    # ... agent logic ...
    return "result"

# The circuit breaker wraps the call
try:
    result = breaker.call(run_agent_task, {"query": "revenue Q3"})
    print(f"Result: {result}")
except CircuitOpenError as e:
    print(f"Agent isolated: {e}")
    print(f"Retry after: {e.retry_after:.0f}s")

Manual Control

# Manual success/failure recording
try:
    result = run_agent_task(task)
    breaker.record_success()
except Exception:
    breaker.record_failure()
    raise

# Check state
print(f"State: {breaker.state}")            # CLOSED, OPEN, or HALF_OPEN
print(f"Failures: {breaker.failure_count}")

# Manual reset after deploying a fix
breaker.reset()

Fleet Management

class AgentFleetBreakers:
    """Manage circuit breakers for a fleet of agents."""

    def __init__(self, config: CircuitBreakerConfig | None = None):
        self._config = config or CircuitBreakerConfig()
        self._breakers: dict[str, CircuitBreaker] = {}

    def get(self, agent_id: str) -> CircuitBreaker:
        if agent_id not in self._breakers:
            self._breakers[agent_id] = CircuitBreaker(agent_id, self._config)
        return self._breakers[agent_id]

    def open_circuits(self) -> list[str]:
        return [aid for aid, cb in self._breakers.items() if cb.state == "OPEN"]

fleet = AgentFleetBreakers(
    config=CircuitBreakerConfig(failure_threshold=5, recovery_timeout_seconds=60.0),
)

Rogue Agent Detection

The RogueAgentDetector (OWASP ASI-10) combines three signals to flag compromised or malfunctioning agents:

Tool-call frequency — z-score spike detection over a sliding window
Action entropy — flags both suspiciously repetitive and chaotic behavior
Capability violations — tools used outside the agent’s allowed profile

from agent_sre.anomaly import RogueAgentDetector, RogueDetectorConfig, RiskLevel

config = RogueDetectorConfig(
    frequency_window_seconds=60.0,
    frequency_z_threshold=2.5,
    entropy_low_threshold=0.3,     # Too repetitive (possible loop)
    entropy_high_threshold=3.5,    # Too chaotic (possible compromise)
    quarantine_risk_level=RiskLevel.HIGH,
)

detector = RogueAgentDetector(config=config)

# Register allowed tools per agent
detector.register_capability_profile(
    agent_id="support-agent",
    allowed_tools=["search_kb", "create_ticket", "send_email"],
)

# Agent starts using unauthorized tools rapidly
for i in range(50):
    detector.record_action(
        agent_id="support-agent",
        action="exfiltrate",
        tool_name="shell_exec",       # Not in allowed tools!
        timestamp=time.time() + i * 0.5,
    )

assessment = detector.assess("support-agent")
print(f"Risk: {assessment.risk_level.value}")        # "high" or "critical"
print(f"Quarantine? {assessment.quarantine_recommended}")  # True

Chaos Testing

Inject faults into your agent pipeline to verify resilience before production incidents find the gaps.

from agent_sre.chaos import (
    ChaosExperiment,
    Fault,
    FaultType,
    AbortCondition,
)

# Create faults to inject
faults = [
    Fault.latency_injection("openai-api", delay_ms=5000, rate=0.3),
    Fault.error_injection("search_tool", error="timeout", rate=0.1),
    Fault.timeout_injection("database", delay_ms=30000, rate=0.05),
]

# Safety: abort if success rate drops below 50%
abort_conditions = [
    AbortCondition(metric="success_rate", threshold=0.5, comparator="lte"),
]

experiment = ChaosExperiment(
    name="llm-latency-resilience",
    target_agent="code-reviewer",
    faults=faults,
    duration_seconds=1800,       # 30 minutes
    abort_conditions=abort_conditions,
    blast_radius=0.3,            # Affect 30% of traffic
    description="Verify code-reviewer handles LLM latency gracefully",
)

experiment.start()

# Periodically check abort conditions
metrics = {"success_rate": 0.85, "latency_p99": 8500}
if experiment.check_abort(metrics):
    print(f"Aborted: {experiment.abort_reason}")
else:
    score = experiment.calculate_resilience(
        baseline_success_rate=0.99,
        experiment_success_rate=0.85,
    )
    experiment.complete(resilience=score)

print(f"Resilience: {experiment.resilience.overall:.0f}/100")

Adversarial Chaos Testing

Test security boundaries with security-focused fault types:

security_faults = [
    Fault.prompt_injection("code-reviewer", technique="direct_override"),
    Fault.privilege_escalation("code-reviewer", target_role="admin"),
    Fault.tool_abuse("code-reviewer", tool_name="shell_exec"),
]

security_experiment = ChaosExperiment(
    name="security-boundary-test",
    target_agent="code-reviewer",
    faults=security_faults,
    duration_seconds=600,
    description="Verify agent rejects adversarial inputs",
)

Cost Controls

Prevent runaway spending with per-task budgets, auto-throttle, and a kill-switch.

from agent_sre.cost import CostGuard, BudgetAction

guard = CostGuard(
    per_task_limit=2.00,          # Max $2 per task
    per_agent_daily_limit=50.00,  # Max $50/day per agent
    org_monthly_budget=5000.00,   # Org-wide cap
    auto_throttle=True,           # Throttle at 85% daily budget
    kill_switch_threshold=0.95,   # Kill agent at 95% daily budget
    anomaly_detection=True,       # Detect cost spikes
)

# Pre-flight check
allowed, reason = guard.check_task("research-agent", estimated_cost=1.50)

# Record actual cost after task
alerts = guard.record_cost(
    agent_id="research-agent",
    task_id="task-001",
    cost_usd=0.45,
    breakdown={"input_tokens": 0.15, "output_tokens": 0.25, "tool_calls": 0.05},
)

for alert in alerts:
    if alert.action == BudgetAction.KILL:
        print("🛑 Agent killed — stop all tasks immediately")
    elif alert.action == BudgetAction.THROTTLE:
        print("⚠ Agent throttled — reduce task rate")

Production SRE Pipeline

Combine all components into a production-ready pipeline:

"""Production SRE pipeline for AI agents."""

from agent_sre.anomaly import RogueAgentDetector, RogueDetectorConfig, RiskLevel
from agent_sre.cascade.circuit_breaker import CircuitBreaker, CircuitBreakerConfig, CircuitOpenError
from agent_sre.slo import SLI, SLIValue, SLO, ErrorBudget
from agent_sre.slo.objectives import ExhaustionAction
from agent_sre.cost import CostGuard, BudgetAction

AGENT_ID = "production-agent"

# Configure all components
rogue_detector = RogueAgentDetector(
    config=RogueDetectorConfig(
        frequency_z_threshold=3.0,
        quarantine_risk_level=RiskLevel.HIGH,
    ),
)
rogue_detector.register_capability_profile(
    AGENT_ID,
    allowed_tools=["search", "read_file", "write_file", "run_tests"],
)

breaker = CircuitBreaker(
    agent_id=AGENT_ID,
    config=CircuitBreakerConfig(
        failure_threshold=5,
        recovery_timeout_seconds=60.0,
    ),
)

class SuccessRateSLI(SLI):
    def collect(self) -> SLIValue:
        values = self.values_in_window()
        if not values:
            return self.record(1.0)
        good = sum(1 for v in values if v.is_good)
        return self.record(good / len(values))

slo = SLO(
    name=f"{AGENT_ID}-reliability",
    indicators=[SuccessRateSLI(name="success_rate", target=0.995, window="24h")],
    error_budget=ErrorBudget(
        total=0.005,
        exhaustion_action=ExhaustionAction.CIRCUIT_BREAK,
    ),
    agent_id=AGENT_ID,
)

cost_guard = CostGuard(
    per_task_limit=2.00,
    per_agent_daily_limit=100.00,
    auto_throttle=True,
    kill_switch_threshold=0.95,
)

The pipeline gives you:

Layer	Component	Protection
Pre-flight	`CostGuard.check_task`	Blocks tasks that would exceed budget
Pre-flight	`RogueAgentDetector.assess`	Quarantines compromised agents
Execution	`CircuitBreaker.call`	Isolates failing agents
Post-flight	`ErrorBudget.record_event`	Tracks reliability over time
Post-flight	`CostGuard.record_cost`	Detects cost anomalies, auto-throttles
Post-flight	`RogueAgentDetector.record_action`	Builds behavioral baseline

Progressive Delivery and Replay Debugging

Progressive Delivery

Use agent_sre.delivery.BlueGreenManager to safely roll out new agent versions with validation and auto-rollback. Canary releases let you test policy updates on a slice of traffic before full deployment.

Replay Debugging

agent_sre includes deterministic replay of agent sessions for post-incident analysis. Traces capture sufficient detail to allow re-execution and regression detection against golden traces.

Alerting

Connect agent_sre.alerts.AlertManager to your notification system (PagerDuty, Slack, Teams) for burn rate alerts, circuit breaker state transitions, and cost anomalies.

Scheduled Chaos

Use agent_sre.chaos.ChaosScheduler for recurring resilience tests with blackout windows. Run weekly to catch regressions before production traffic does.

Get Started

Core Concepts

Guides

Compliance

Reference

Agent SRE: SLOs, Circuit Breakers, and Chaos Testing

Why SRE for AI Agents?

SLO Engine

Available SLI Types

Define an SLI and SLO

Record Events and Check Status

Circuit Breakers

Setup and Usage

Manual Control

Fleet Management

Rogue Agent Detection

Chaos Testing

Adversarial Chaos Testing

Cost Controls

Production SRE Pipeline

Progressive Delivery and Replay Debugging

Progressive Delivery

Replay Debugging

Alerting

Scheduled Chaos

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Compliance

Reference

Documentation Index

​Why SRE for AI Agents?

​SLO Engine

​Available SLI Types

​Define an SLI and SLO

​Record Events and Check Status

​Circuit Breakers

​Setup and Usage

​Manual Control

​Fleet Management

​Rogue Agent Detection

​Chaos Testing

​Adversarial Chaos Testing

​Cost Controls

​Production SRE Pipeline

​Progressive Delivery and Replay Debugging

Progressive Delivery

Replay Debugging

Alerting

Scheduled Chaos

Build docs developers (and LLMs) love

Why SRE for AI Agents?

SLO Engine

Available SLI Types

Define an SLI and SLO

Record Events and Check Status

Circuit Breakers

Setup and Usage

Manual Control

Fleet Management

Rogue Agent Detection

Chaos Testing

Adversarial Chaos Testing

Cost Controls

Production SRE Pipeline

Progressive Delivery and Replay Debugging