Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt

Use this file to discover all available pages before exploring further.

Every incident in Sentinel moves through a well-defined sequence of statuses, from the moment an alert arrives to the final post-mortem stored in memory. Each transition is driven by real system events — a classification completing, an engineer approving an action, a verification check passing — and every step is written to the incident_events table with a precise timestamp so you always have a full audit trail.

Status Flow

The eight statuses form a directed graph. The happy path ends at resolved; any unrecoverable failure lands at failed.
detected


investigating        ← supervisor begins classification (Lab 1)


analyzed             ← investigation complete; reasoning stored

   ├──(no action proposed)──────────────────────────────► (terminal: stays analyzed)


awaiting_approval    ← safe whitelisted action is ready for engineer review


executing_solution   ← engineer approves; atomic DB claim prevents double-execution


verifying            ← post-execution health check running

   ├──(healthy)──► resolved  ← post-mortem generated & stored in ChromaDB

   └──(unhealthy/error)──► failed

Status Reference

StatusTriggerValid Next States
detectedIncident created via alert webhook or manual dashboard entryinvestigating, failed
investigatingSupervisor starts — classification LLM call completes, incident_type written to DBanalyzed, failed
analyzedAgent investigation finishes; full agent_reasoning markdown persistedawaiting_approval, failed
awaiting_approvalA safe whitelisted proposed_action passed all guardrailsexecuting_solution, failed
executing_solutionEngineer clicks Approve; action claimed atomically in DBverifying, failed
verifyingExecution succeeded; health verification is runningresolved, failed
resolvedVerification confirms recovery; post-mortem auto-generated— (terminal)
failedAny unhandled exception, guardrail hard-block, or failed verification— (terminal)
The REST API accepts status=active as a filter alias. It expands to all non-terminal statuses: detected, investigating, analyzed, awaiting_approval, executing_solution, and verifying. Use it to fetch the live triage queue without listing every intermediate state.

Event Sourcing with incident_events

Every status transition appends a row to the incident_events table. The schema captures incident_id, status, and a UTC created_at timestamp. This gives you:
  • MTTR calculation — subtract detected timestamp from resolved timestamp
  • Bottleneck analysis — find how long incidents spend in awaiting_approval
  • Audit trail — immutable log that cannot be overwritten, even if the incident row is updated
The supervisor calls record_event(incident_id, status) after every UPDATE to the incidents table, so the events log always lags the status by milliseconds at most.

Incident Metadata

Source Types

source_typeDescription
containerAlert originates from a running container (Docker, Podman, or Kubernetes workload)
databaseAlert originates from a PostgreSQL instance
manualEngineer created the incident directly from the dashboard

Container Runtimes

When source_type is container, the container_runtime field determines which specialist agent handles triage:
container_runtimeAgent
dockerDockerAgent
podmanPodmanAgent
kubernetesKubernetesAgent

Severity Levels

SeverityTypical Use
criticalService completely down, data loss risk
highSevere degradation, SLA breach imminent
mediumPartial failure, impact contained
lowWarning-level signal, no immediate user impact

Incident Types

The classification model maps every incident into one of ten canonical types. These are the only values the frontend and guardrails accept:
incident_typeDescription
app_crashProcess exited unexpectedly
oomContainer or process killed by OOM killer
config_errorMisconfiguration detected in env vars, mounts, or init
dependency_failureUpstream service or database unreachable
memory_pressureHigh memory utilization approaching limits
cpu_throttlingCPU quota exceeded; process being throttled
restart_loopContainer restarting repeatedly (CrashLoopBackOff equivalent)
network_errorDNS failure, port unreachable, or timeout
disk_pressureNode or volume approaching capacity
unknownClassifier could not determine type, or guardrail forced fallback

Manual Incident Creation

Incidents with source_type: manual can be created directly from the Sentinel dashboard. Fill in the title, target, severity, and optionally paste logs. The incident enters at detected and immediately triggers the same full triage pipeline — classification, investigation, and action proposal — as any automated alert. This is useful for escalating issues discovered through monitoring dashboards that do not yet have webhook integrations.

The resolved Path and Post-Mortem Generation

When an incident reaches resolved, the system automatically triggers post-mortem generation. The LLM synthesizes:
  • Root cause — derived from the agent’s investigation analysis
  • Timeline — reconstructed from incident_events timestamps
  • MTTR — computed from detectedresolved event delta
  • Lessons learned — recommendations based on the incident type and actions taken
The post-mortem is stored as a document in the incidents-{domain} ChromaDB collection (episodic memory), making it retrievable for future incidents with similar signatures. It is also surfaced in the dashboard under the incident detail view.

The failed Path

An incident transitions to failed in three situations:
  1. Action rejected by guardrail — the proposed command did not match any allowed pattern, so proposed_action is cleared and the incident cannot proceed past analyzed.
  2. Execution error — the command ran but returned a non-zero exit code or threw an exception.
  3. Pipeline exception — any unhandled error in the supervisor (network timeout, OpenAI outage, Supabase connectivity issue) is caught by the top-level except block, which writes status: failed and stores the error message in action_error for debugging.
A failed incident is terminal — Sentinel will not retry automatically. Engineers can manually re-trigger triage from the dashboard after addressing the underlying cause.

Build docs developers (and LLMs) love