Sentinel SoftServe: AI Copilot for DevOps Incident Triage

Sentinel SoftServe is an agentic AI co-pilot designed for DevOps and SRE engineers who need to move fast during production incidents. It watches your infrastructure continuously, automatically detects crashes, resource exhaustion, and service degradations, then orchestrates a full triage pipeline — from log collection and root-cause analysis to proposing a safe corrective action — all without requiring you to dig through dashboards manually. Every AI decision passes through a human-in-the-loop approval gate before any remediation command is executed on your infrastructure. The project is an academic industry collaboration between Universidad EAFIT and SoftServe, deployed live at sentinel-softserve-1.onrender.com.

The problem Sentinel solves

Modern containerised workloads generate thousands of metrics and log lines per minute. When something goes wrong at 2 a.m., engineers waste critical minutes correlating Prometheus alerts with Loki logs, reading runbooks, and deciding whether to restart a container or roll back a deployment. Sentinel eliminates that toil by:

Automatically ingesting alerts from Alertmanager the moment Prometheus fires a rule.
Fetching and analysing logs from Loki against a ChromaDB runbook knowledge base.
Classifying the incident type and routing it to the right specialist agent (Docker, Podman, Kubernetes, or PostgreSQL).
Proposing one safe, whitelisted command for human approval — never executing anything autonomously.
Generating a post-mortem and writing the incident into episodic memory so future triage gets smarter.

Tech stack

Sentinel is built from purpose-chosen components across every layer of the stack.

Layer	Technology
Frontend	React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui
Backend	FastAPI + Uvicorn
AI Orchestration	LangGraph + LangChain
LLM	OpenAI gpt-4o-mini
Knowledge Base	ChromaDB (runbooks RAG + episodic memory)
Auth & DB	Supabase (email/password, JWT, Realtime)
Agent Observability	LangFuse v2 (self-hosted)
Incident Detection	cAdvisor + Prometheus + Alertmanager
Logs	Loki + Promtail
Dashboards	Grafana

Supported runtimes

Sentinel ships a dedicated specialist agent for each supported runtime. Each agent carries its own tool palette and ChromaDB runbook collection so investigations stay tightly scoped.

Runtime	Agent	Tools
Docker	DockerAgent	`docker_inspect`, `docker_logs`, `docker_stats`, `docker_ps`
Podman	PodmanAgent	`podman_inspect`, `podman_logs`, `podman_stats`, `podman_ps`
Kubernetes	KubernetesAgent	`get_pod_status`, `describe_pod`, `get_pod_logs`, `get_pod_events`, `get_deployment_status`, `list_failing_pods`
PostgreSQL	PostgresAgent	`pg_stat_activity`, `pg_stat_database`, `pg_stat_replication`, `pg_locks`

Where to go next

Quickstart

Run Sentinel locally in under 10 minutes with Docker Compose.

Architecture

Understand the LangGraph agent pipeline and observability stack.

Supported Runtimes

Deep-dive into each specialist agent and its tool set.

API Reference

Explore the FastAPI endpoints that power the dashboard and webhooks.

Prerequisites before you begin:

Docker Desktop installed and running
Node.js 20+
Python 3.9+
A Supabase project with a URL, service-role key, anon key, and JWT secret
An OpenAI API key (gpt-4o-mini access required)

Get Started

Deployment

Core Concepts

Supported Runtimes

Using the Dashboard

Sentinel SoftServe: AI Copilot for DevOps Incident Triage

The problem Sentinel solves

Tech stack

Supported runtimes

Where to go next

Quickstart

Architecture

Supported Runtimes

API Reference

Build docs developers (and LLMs) love

Get Started

Deployment

Core Concepts

Supported Runtimes

Using the Dashboard

Documentation Index

​The problem Sentinel solves

​Tech stack

​Supported runtimes

​Where to go next

Quickstart

Architecture

Supported Runtimes

API Reference

Build docs developers (and LLMs) love

The problem Sentinel solves

Tech stack

Supported runtimes

Where to go next