Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt

Use this file to discover all available pages before exploring further.

Sentinel is composed of five tightly integrated layers: a Prometheus-based observability stack that detects incidents, a Loki log-collection pipeline that gathers evidence, a FastAPI backend that orchestrates everything, a LangGraph multi-agent engine that performs the actual triage reasoning, and a React dashboard that surfaces results and captures human decisions in real time. This page explains how those layers connect, what each component does, and where you can find the code that implements each part.

High-level flow

The sequence below shows every hop an incident takes from initial detection through post-mortem generation.
Docker / Podman / Kubernetes / PostgreSQL incident occurs
        |
cAdvisor / postgres-exporter detects it
        |
Prometheus fires alert -> Alertmanager -> POST /api/alerts
        |
Backend fetches logs from Loki, creates incident in Supabase
        |
LangGraph multi-agent engine (background):
  Guardrail Input  — sanitize logs, detect prompt injection
  Lab 1            — Alert Intake: classify incident type (GPT-4o-mini)
  Lab 2            — Investigation: specialist agent + RAG runbooks + tools
  Guardrail Output — validate analysis stays in DevOps domain
  Lab 3            — Decision & Planning: propose safe corrective action
        |
Engineer approves action in Dashboard
        |
  Lab 4            — Action & Verification: execute + verify recovery
  Lab 5            — Post-Incident: generate post-mortem, update episodic memory
        |
React Dashboard updates in real time via Supabase Realtime

Component breakdown

Observability layer

cAdvisor scrapes container metrics from the Docker socket and exports them to Prometheus. The postgres-exporter does the same for PostgreSQL databases. Prometheus evaluates alert rules defined in prometheus/ and, when a threshold is breached, sends a POST /api/alerts webhook to the backend through Alertmanager. The Alertmanager configuration in alertmanager/alertmanager.yml carries a shared ALERT_WEBHOOK_SECRET so the backend can authenticate inbound alerts.

Log collection

Promtail tails Docker container logs using the Docker socket and ships them to Loki using the configuration in promtail/. When the backend receives an alert, the alert_processor service queries Loki for the relevant container’s recent log lines. Those logs are attached to the Supabase incident record and forwarded to the guardrail pipeline before any LLM ever sees them.

Backend (FastAPI)

The FastAPI application in Backend/main.py exposes four routers:
RouterPrefixResponsibility
incidents/api/incidentsCRUD, status updates, export, post-mortem generation
alerts/api/alertsAlertmanager webhook — creates incident and spawns background triage
actions/api/actionsExecute, reject, or postpone a proposed corrective action
health/healthIntegration health check for all downstream services
A lifespan hook starts a recovery loop on startup that rescues any incident records left in a non-terminal state by a previous crash — ensuring no alert is silently dropped across backend restarts.

Agent pipeline (LangGraph)

The triage logic lives in Backend/services/agents/ and is orchestrated by supervisor.py. The supervisor implements the LangGraph pipeline as an explicit sequence of nodes, each mapped to a LangFuse span so every decision is fully traceable.
StageCode locationWhat it does
Guardrail Inputguardrail_graph.pyguardrails.check_inputTruncates logs to 4 000 chars, strips prompt-injection patterns, then passes to the LLM judge
Lab 1 — Alert Intakesupervisor._classifyGPT-4o-mini classifies the incident into one of ten types: app_crash, oom, config_error, dependency_failure, memory_pressure, cpu_throttling, restart_loop, network_error, disk_pressure, or unknown
Routingregistry.find_agent_forSelects the specialist agent (Docker / Podman / Kubernetes / PostgreSQL) based on incident labels
Lab 2 — Investigationagent.investigateSpecialist agent calls domain-specific tools, queries runbooks-{domain} and incidents-{domain} ChromaDB collections, and produces a structured analysis
Guardrail Outputguardrail_graph.pyguardrails.check_analysis_outputValidates the analysis is non-empty, on-topic for DevOps, and then passes to the LLM judge to catch semantic drift
Lab 3 — Decision & Planningsupervisor._build_proposed_actionGenerates a single safe, whitelisted command for human approval; the action is then re-validated against the _ALLOWED_ACTION_PATTERNS regex whitelist in guardrails.check_proposed_action
Lab 4 — Action & Verificationrouters/actions.py + services/verification.pyOn engineer approval, executes the command and verifies recovery
Lab 5 — Post-Incidentservices/postmortem/Generates a structured post-mortem document; agent.remember_incident writes a summary to ChromaDB episodic memory

Guardrail design

Both guardrail stages run as compiled LangGraph state graphs (_build_graph() in guardrail_graph.py). Each graph has two nodes — a fast deterministic rules node and an LLM semantic judge node — connected by a conditional edge. If the rules node neutralises an obvious injection, the text still passes to the LLM judge in case a paraphrased attack slipped through. The action guardrail uses a strict regex whitelist that explicitly allows only:
  • docker restart <name> / docker logs <name>
  • podman restart <name> / podman logs <name>
  • kubectl rollout restart deployment/<name> [-n <namespace>]
  • kubectl delete pod <name> [-n <namespace>]
  • kubectl scale deployment/<name> --replicas=<0–10> [-n <namespace>]
  • pg_stat_activity, pg_cancel_backend, pg_terminate_backend against a named database
Any command containing shell metacharacters (;, &&, |, `, $(, >, <) is blocked unconditionally before the whitelist check.

Memory layer

ChromaDB stores two types of collections per domain:
  • runbooks-{domain} — curated human-written runbooks seeded via scripts/seed_*.py. The specialist agent retrieves the top-k most relevant runbooks (default k=3) for every investigation using memory/runbooks.py::query_runbooks.
  • incidents-{domain} — episodic memory written by agent.remember_incident after each resolved incident. Similar past incidents are retrieved during Lab 2 to surface recurring patterns.

Database

Supabase hosts two primary tables:
  • incidents — one row per alert, carrying status, severity, agent_reasoning, proposed_action, incident_type, and all timestamps.
  • incident_events — append-only event log tracking every status transition (e.g. detecting → investigating → analyzed → awaiting_approval → resolved).
Supabase Realtime pushes row-level changes to connected dashboard clients over WebSockets so the UI reflects pipeline progress without polling.

Agent observability (LangFuse)

Every run_triage call creates a LangFuse trace. Each pipeline stage (guardrails, Lab 1–3) is recorded as a child span, and each GPT-4o-mini call is recorded as a generation with token usage and latency. If LangFuse is unreachable, the supervisor continues without tracing — it logs a warning and does not block triage.

Frontend (React 19)

The React dashboard subscribes to Supabase Realtime for live incident updates. Key UI components include:
  • IncidentCard — displays status, severity, and incident type at a glance.
  • ApprovalBanner — surfaces the proposed action with Approve / Reject / Postpone controls.
  • AgentReasoningPanel — renders the full Markdown reasoning produced by the supervisor, including tool calls and similar incidents.
  • AuthContext — manages Supabase JWT sessions; all API calls attach the JWT for FastAPI’s auth.get_current_user dependency.

Project structure

Sentinel-SoftServe/
├── docker-compose.yml
├── prometheus/                     # Alert rules + scraping config
├── alertmanager/                   # Webhook routing to backend
├── loki/ & promtail/               # Log collection config
├── Backend/
│   ├── main.py                     # FastAPI app + CORS + routers
│   ├── auth.py                     # JWT validation (ES256/HS256)
│   ├── routers/
│   │   ├── incidents.py            # Incident CRUD + export + post-mortem
│   │   ├── alerts.py               # Alertmanager webhook
│   │   ├── actions.py              # Execute / reject / postpone actions
│   │   └── health.py               # Integration health check
│   ├── services/
│   │   ├── alert_processor.py      # Alert -> Supabase incident
│   │   ├── verification.py         # Post-action verification
│   │   ├── incident_events.py      # Status transition events
│   │   ├── postmortem/             # Post-mortem generation
│   │   └── agents/
│   │       ├── supervisor.py       # Orchestrates classify -> route -> investigate -> persist
│   │       ├── guardrails.py       # Deterministic safety rules
│   │       ├── guardrail_graph.py  # LangGraph guardrail pipeline
│   │       ├── llm_guardrail.py    # LLM semantic judge
│   │       ├── docker/             # DockerAgent + tools
│   │       ├── podman/             # PodmanAgent + tools
│   │       ├── kubernetes/         # KubernetesAgent + tools
│   │       ├── postgres/           # PostgresAgent + tools
│   │       └── memory/             # ChromaDB runbooks + episodic memory
│   └── scripts/
│       └── seed_*.py               # ChromaDB runbook loaders
└── Frontend/
    └── src/
        ├── pages/                  # Login, Dashboard, Setup
        ├── components/             # IncidentCard, ApprovalBanner, AgentReasoningPanel, ...
        ├── services/               # incidentActions, incidentExports
        └── contexts/               # AuthContext

Explore further

Core Concepts: Incident Lifecycle

Understand every status transition an incident passes through, from detection to post-mortem.

Multi-Agent Pipeline

A deep-dive into each Lab stage, the guardrail graph design, and how LangFuse traces the pipeline.

Supported Runtimes

Tool palettes, runbook collections, and agent behaviour for Docker, Podman, Kubernetes, and PostgreSQL.

API Reference

Full OpenAPI reference for incidents, alerts, actions, and health endpoints.

Build docs developers (and LLMs) love