Sentinel is composed of five tightly integrated layers: a Prometheus-based observability stack that detects incidents, a Loki log-collection pipeline that gathers evidence, a FastAPI backend that orchestrates everything, a LangGraph multi-agent engine that performs the actual triage reasoning, and a React dashboard that surfaces results and captures human decisions in real time. This page explains how those layers connect, what each component does, and where you can find the code that implements each part.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt
Use this file to discover all available pages before exploring further.
High-level flow
The sequence below shows every hop an incident takes from initial detection through post-mortem generation.Component breakdown
Observability layer
cAdvisor scrapes container metrics from the Docker socket and exports them to Prometheus. Thepostgres-exporter does the same for PostgreSQL databases. Prometheus evaluates alert rules defined in prometheus/ and, when a threshold is breached, sends a POST /api/alerts webhook to the backend through Alertmanager. The Alertmanager configuration in alertmanager/alertmanager.yml carries a shared ALERT_WEBHOOK_SECRET so the backend can authenticate inbound alerts.
Log collection
Promtail tails Docker container logs using the Docker socket and ships them to Loki using the configuration inpromtail/. When the backend receives an alert, the alert_processor service queries Loki for the relevant container’s recent log lines. Those logs are attached to the Supabase incident record and forwarded to the guardrail pipeline before any LLM ever sees them.
Backend (FastAPI)
The FastAPI application inBackend/main.py exposes four routers:
| Router | Prefix | Responsibility |
|---|---|---|
incidents | /api/incidents | CRUD, status updates, export, post-mortem generation |
alerts | /api/alerts | Alertmanager webhook — creates incident and spawns background triage |
actions | /api/actions | Execute, reject, or postpone a proposed corrective action |
health | /health | Integration health check for all downstream services |
lifespan hook starts a recovery loop on startup that rescues any incident records left in a non-terminal state by a previous crash — ensuring no alert is silently dropped across backend restarts.
Agent pipeline (LangGraph)
The triage logic lives inBackend/services/agents/ and is orchestrated by supervisor.py. The supervisor implements the LangGraph pipeline as an explicit sequence of nodes, each mapped to a LangFuse span so every decision is fully traceable.
| Stage | Code location | What it does |
|---|---|---|
| Guardrail Input | guardrail_graph.py → guardrails.check_input | Truncates logs to 4 000 chars, strips prompt-injection patterns, then passes to the LLM judge |
| Lab 1 — Alert Intake | supervisor._classify | GPT-4o-mini classifies the incident into one of ten types: app_crash, oom, config_error, dependency_failure, memory_pressure, cpu_throttling, restart_loop, network_error, disk_pressure, or unknown |
| Routing | registry.find_agent_for | Selects the specialist agent (Docker / Podman / Kubernetes / PostgreSQL) based on incident labels |
| Lab 2 — Investigation | agent.investigate | Specialist agent calls domain-specific tools, queries runbooks-{domain} and incidents-{domain} ChromaDB collections, and produces a structured analysis |
| Guardrail Output | guardrail_graph.py → guardrails.check_analysis_output | Validates the analysis is non-empty, on-topic for DevOps, and then passes to the LLM judge to catch semantic drift |
| Lab 3 — Decision & Planning | supervisor._build_proposed_action | Generates a single safe, whitelisted command for human approval; the action is then re-validated against the _ALLOWED_ACTION_PATTERNS regex whitelist in guardrails.check_proposed_action |
| Lab 4 — Action & Verification | routers/actions.py + services/verification.py | On engineer approval, executes the command and verifies recovery |
| Lab 5 — Post-Incident | services/postmortem/ | Generates a structured post-mortem document; agent.remember_incident writes a summary to ChromaDB episodic memory |
Guardrail design
Both guardrail stages run as compiled LangGraph state graphs (_build_graph() in guardrail_graph.py). Each graph has two nodes — a fast deterministic rules node and an LLM semantic judge node — connected by a conditional edge. If the rules node neutralises an obvious injection, the text still passes to the LLM judge in case a paraphrased attack slipped through. The action guardrail uses a strict regex whitelist that explicitly allows only:
docker restart <name>/docker logs <name>podman restart <name>/podman logs <name>kubectl rollout restart deployment/<name> [-n <namespace>]kubectl delete pod <name> [-n <namespace>]kubectl scale deployment/<name> --replicas=<0–10> [-n <namespace>]pg_stat_activity,pg_cancel_backend,pg_terminate_backendagainst a named database
;, &&, |, `, $(, >, <) is blocked unconditionally before the whitelist check.
Memory layer
ChromaDB stores two types of collections per domain:runbooks-{domain}— curated human-written runbooks seeded viascripts/seed_*.py. The specialist agent retrieves the top-k most relevant runbooks (defaultk=3) for every investigation usingmemory/runbooks.py::query_runbooks.incidents-{domain}— episodic memory written byagent.remember_incidentafter each resolved incident. Similar past incidents are retrieved during Lab 2 to surface recurring patterns.
Database
Supabase hosts two primary tables:incidents— one row per alert, carrying status, severity,agent_reasoning,proposed_action,incident_type, and all timestamps.incident_events— append-only event log tracking every status transition (e.g.detecting → investigating → analyzed → awaiting_approval → resolved).
Agent observability (LangFuse)
Everyrun_triage call creates a LangFuse trace. Each pipeline stage (guardrails, Lab 1–3) is recorded as a child span, and each GPT-4o-mini call is recorded as a generation with token usage and latency. If LangFuse is unreachable, the supervisor continues without tracing — it logs a warning and does not block triage.
Frontend (React 19)
The React dashboard subscribes to Supabase Realtime for live incident updates. Key UI components include:IncidentCard— displays status, severity, and incident type at a glance.ApprovalBanner— surfaces the proposed action with Approve / Reject / Postpone controls.AgentReasoningPanel— renders the full Markdown reasoning produced by the supervisor, including tool calls and similar incidents.AuthContext— manages Supabase JWT sessions; all API calls attach the JWT for FastAPI’sauth.get_current_userdependency.
Project structure
Explore further
Core Concepts: Incident Lifecycle
Understand every status transition an incident passes through, from detection to post-mortem.
Multi-Agent Pipeline
A deep-dive into each Lab stage, the guardrail graph design, and how LangFuse traces the pipeline.
Supported Runtimes
Tool palettes, runbook collections, and agent behaviour for Docker, Podman, Kubernetes, and PostgreSQL.
API Reference
Full OpenAPI reference for incidents, alerts, actions, and health endpoints.