When Sentinel receives an alert, it does not hand it to a single monolithic LLM. Instead, it routes the incident through a structured multi-agent pipeline built on LangGraph — each stage has a dedicated responsibility, defined inputs and outputs, and its own guardrail checkpoint. The result is a traceable, auditable triage process where every decision can be explained and every LLM call is logged to LangFuse.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline Overview
Data Contracts
Before exploring each stage, it helps to understand the two core data structures that flow through the pipeline.IncidentContext
Constructed once in run_langgraph_engine and passed immutably to every downstream stage and agent:
InvestigationResult
Returned by every DomainAgent.investigate() call:
ToolCall records the tool name, args, and a result_preview (first ~500 chars) that is persisted in Supabase alongside the incident.
Pipeline Stages
Before any LLM sees the alert, the input guardrail runs as a two-node LangGraph subgraph defined in
guardrail_graph.py:[LÍNEA NEUTRALIZADA POR GUARDRAIL — posible inyección]) rather than aborting the triage — the incident still needs to be investigatedgpt-4o-mini with a fixed system prompt{ safe, on_topic, reason } in JSON modesafe=true — a downed guardrail never blocks a real incident[CONTENIDO BLOQUEADO POR GUARDRAIL SEMÁNTICO]The classification stage uses
gpt-4o-mini in JSON mode with temperature=0 to map the incident to exactly one of the ten canonical incident types:model_kwargs={"response_format": {"type": "json_object"}}
# Prompt instructs: respond ONLY with valid JSON:
# {"incident_type": "<category>", "reasoning": "<max 2 sentences>"}
The LLM receives: the incident title, target, severity, and the first 800 characters of logs as a preview.
After the LLM responds, the classification guardrail (
guardrails.check_incident_type) validates the returned type against the allowed set. If the LLM hallucinated a category outside the ten valid types, it is silently forced to unknown. This prevents any downstream frontend breakage.The incident status advances to
investigating and incident_type + initial agent_reasoning are written to Supabase.The supervisor calls
find_agent_for(ctx), which iterates registered agents and returns the first whose matches(ctx) returns True. The registry currently supports four specialist agents:DockerAgent
Handles
container_runtime: docker. Tools: inspect, logs, stats, top, diff.PodmanAgent
Handles
container_runtime: podman. Tools: inspect, logs, pod ps, stats.KubernetesAgent
Handles
container_runtime: kubernetes. Tools: kubectl describe, logs, get events, top pods.PostgresAgent
Handles
source_type: database. Tools: pg_stat_activity, pg_stat_bgwriter, query analysis.recall_runbooks(query="{incident_type} {title}", k=5) fetches the most relevant procedure documents from the runbooks-{domain} ChromaDB collectionrecall_similar_incidents(query, k=6) retrieves past incidents with cosine distance ≤ 1.5 from the incidents-{domain} collectionanalysis stored in InvestigationResultAfter the agent returns, the output guardrail (
guardrail_graph.run_output_guardrail) runs the same two-node graph to check that the analysis stays within the DevOps domain and is not suspiciously short (< 20 characters). Off-topic content gets a warning banner prepended rather than being blocked.The supervisor deterministically generates a safe remediation command based on
incident_type, source_type, container_runtime, and target. No LLM is used here — the logic is a pure decision tree that produces whitelisted commands only:# Examples of what _build_proposed_action produces:
"docker restart api-gateway"
"podman logs payment-service"
"kubectl rollout restart deployment/auth-service -n production"
"kubectl delete pod crashlooping-pod-7f9b -n default"
"pg_terminate_backend mydb"
"pg_stat_activity mydb"
The proposed action is then independently re-validated by
guardrails.check_proposed_action(). This is defense-in-depth: even if _build_proposed_action were modified to produce something unexpected, or a future agent proposed a custom action, the guardrail catches it before it reaches the engineer. Shell metacharacters (;, &&, ||, |, `, $(, >, <) trigger an immediate block regardless of pattern matching.If the action passes, the incident advances to
awaiting_approval and proposed_action is stored in Supabase.executing_solution — this prevents double-execution if the engineer clicks twice or two browser tabs are open simultaneouslyverifyingresolved, resolved_at timestamp setfailed, error stored in action_errorincident_events timestampsdetected and resolved eventsThe Agent Registry Pattern
Agents self-register at import time. Adding a new domain requires only creating aDomainAgent subclass and calling register_agent() — no changes to the supervisor or engine:
| Function | Description |
|---|---|
find_agent_for(ctx) | Returns the first agent whose matches(ctx) is True; None if none found |
list_agents() | Returns all registered agents (used in error messages) |
FastAPI Background Execution
The entire pipeline runs as a FastAPIBackgroundTask. The HTTP response to the alert webhook returns immediately with 201 Created, and triage runs asynchronously:
Every stage of the pipeline creates a named span in LangFuse, linked to the
incident_id as the session ID. The trace hierarchy is: [severity] title (root trace) → Guardrail Input → Lab 1 — Alert Intake → classify-llm (generation) → Lab 2 — Investigation ({agent.name}) → Guardrail Output → Lab 3 — Decision & Planning. Each span records its inputs, outputs, and whether guardrails passed. Use the LangFuse dashboard at LANGFUSE_HOST to inspect any incident’s full reasoning chain.