Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt

Use this file to discover all available pages before exploring further.

The PostgresAgent is Sentinel’s specialist for database incidents. It connects to PostgreSQL via psycopg2, queries the pg_stat_* system views, and produces a structured diagnosis of connection exhaustion, deadlocks, long-running transactions, replication lag, and cache hit ratio degradation. Every action it proposes is a safe, targeted operation — from read-only observation (pg_stat_activity) to graceful query cancellation (pg_cancel_backend) or full connection termination (pg_terminate_backend) — and all require explicit human approval before execution.

Runtime Detection

The PostgresAgent claims an alert when either of the following is true:
  1. The alert label source_type=database is set.
  2. The target field starts with postgres/.
# services/agents/postgres/agent.py — PostgresAgent.matches()
def matches(self, ctx: IncidentContext) -> bool:
    source = (ctx.labels.get("source_type") or "").lower()
    if source == "database":
        return True
    return ctx.target.lower().startswith("postgres/")

Target Format

PostgreSQL incidents must use the postgres/<datname> target format, where <datname> is the PostgreSQL database name:
Example targetDatabase name extracted
postgres/sentinel_demosentinel_demo
postgres/myapp_prodmyapp_prod
The agent extracts the database name from this path to filter all queries in its tools.

Prerequisites

Connection Configuration

The tools build a connection DSN in the following priority order:
1

POSTGRES_DSN (preferred)

Set the full connection string directly:
POSTGRES_DSN=postgresql://sentinel:sentinel123@localhost:5432/sentinel_demo
2

Individual connection variables

If POSTGRES_DSN is not set, the agent assembles the DSN from individual variables:
VariableDefaultDescription
POSTGRES_HOSTlocalhostPostgreSQL host
POSTGRES_PORT5432PostgreSQL port
POSTGRES_USERsentinelDatabase user
POSTGRES_PASSWORD(empty)Password
POSTGRES_DBsentinel_demoDatabase name for the connection
The demo PostgreSQL instance is provided by the postgres-demo service in docker-compose.yml, pre-configured with the sentinel user, sentinel123 password, and sentinel_demo database. Its metrics are scraped by the postgres-exporter service:
postgres-exporter:
  image: prometheuscommunity/postgres-exporter:latest
  environment:
    DATA_SOURCE_NAME: "postgresql://sentinel:sentinel123@postgres-demo:5432/sentinel_demo?sslmode=disable"
This exporter enables Alertmanager rules such as PostgresConnectionsExhausted and PostgresDeadlockDetected to fire from real Prometheus metrics.

Tools

All four tools are read-only — they use SELECT queries and never issue INSERT, UPDATE, DELETE, DROP, or any DDL. The tools connect with a short timeout (5 seconds) and close the connection after each call to avoid contributing to connection pressure during an incident.

pg_stat_activity

Queries pg_stat_activity grouped by (datname, state, wait_event_type, wait_event). Returns connection counts per state (active, idle, idle in transaction), max query duration, and total connections vs. max_connections. Use this to diagnose connection pool exhaustion or blocked queries.

pg_stat_database

Queries pg_stat_database for accumulated stats: transaction commits/rollbacks, deadlock count, disk reads vs. cache hits (used to compute cache hit ratio %), and table mutation rates (inserts/updates/deletes). Use this to detect deadlocks, poor cache hit ratio, or abnormal data churn.

pg_stat_replication

Queries pg_stat_replication for all connected standbys: write lag, flush lag, and replay lag in both seconds and bytes. Only returns data when the connected server is a primary with at least one replica. Use this for replication lag alerts.

pg_locks

Joins pg_locks with pg_stat_activity to show active locks, waiting queries, lock modes, table names, and query text. Returns both granted and pending locks, ranked by duration. Use this to diagnose deadlocks or long-running transactions blocking others.

Tool Parameters

ToolParameterTypeDefaultDescription
pg_stat_activitydatnamestring""Filter by database name; empty returns all non-template databases
pg_stat_databasedatnamestring""Filter by database name; empty returns all non-template databases
pg_stat_replication(none)Returns all replica connections
pg_locksdatnamestring""Filter by database name; empty returns all databases

Action Proposals

The Supervisor maps each incident_type to one of three safe database actions. The action operates on the <datname> extracted from the incident target (postgres/<datname>).
Terminates all connections to the named database. This is the most aggressive action and is reserved for incidents where the database is effectively unreachable due to saturation or cascading failures.
Incident typeProposed action
app_crashpg_terminate_backend <datname>
dependency_failurepg_terminate_backend <datname>
restart_looppg_terminate_backend <datname>
memory_pressurepg_terminate_backend <datname>
pg_terminate_backend disconnects all active clients from the specified database. Applications will immediately lose their connections and must reconnect. Use this action only when the database is already non-functional for those clients and a clean slate is necessary. Connection poolers (PgBouncer, pgpool) will automatically reconnect on the next query.

Action Execution Details

When the proposed action is approved, routers/actions.py executes it via _run_pg_command:
ActionSQL executedSide effects
pg_stat_activity <datname>SELECT pid, usename, state, ... FROM pg_stat_activity WHERE datname = %sNone — read-only
pg_cancel_backend <datname>SELECT pid, pg_cancel_backend(pid) FROM pg_stat_activity WHERE datname = %s AND state = 'active'Cancels active queries; connections remain alive
pg_terminate_backend <datname>SELECT pid, pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = %sTerminates all connections to the database
All three commands run with autocommit=True and exclude the backend’s own PID (pid <> pg_backend_pid()) to avoid self-termination.

Simulating an Incident

The demo_postgres.sh.example script demonstrates a PostgresConnectionsExhausted incident against the sentinel_demo database. Here is the condensed simulation flow:
1

Ensure the demo database is running

docker compose up -d postgres-demo postgres-exporter
# Verify connectivity
docker exec sentinel-postgres-demo psql \
  -U sentinel -d sentinel_demo -c "SELECT version();"
2

Fire the alert

curl -s -X POST http://localhost:8000/api/alerts \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "status": "firing",
    "alerts": [{
      "status": "firing",
      "labels": {
        "alertname": "PostgresConnectionsExhausted",
        "severity": "critical",
        "datname": "sentinel_demo",
        "source_type": "database"
      },
      "annotations": {
        "summary": "DB sentinel_demo has no available connections",
        "description": "The connection pool for sentinel_demo is at 100%. New queries are being blocked."
      },
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
    }]
  }'
3

Review the PostgresAgent analysis

The agent calls pg_stat_activity sentinel_demo and pg_stat_database sentinel_demo to retrieve live connection counts, query states, and cache hit ratio. It cross-references the runbooks-postgres collection and any similar past incidents before proposing pg_terminate_backend sentinel_demo (classified as dependency_failure or app_crash).
4

Approve the action

In the dashboard at http://localhost:5173, review the structured analysis and approve the proposed action. The backend executes pg_terminate_backend and transitions the incident to Verifying, then Resolved.

Investigation Flow

Alert received (source_type=database or target starts with postgres/)


Supervisor classifies incident_type


PostgresAgent.investigate(ctx)
      ├── recall_runbooks("connections exhausted ...", k=3)  → runbooks-postgres
      ├── recall_similar_incidents("...", k=3)              → incidents-postgres


ReAct loop (max 4 iterations)
      ├── pg_stat_activity <datname>      ← connection states, blocking queries
      ├── pg_stat_database <datname>      ← deadlocks, cache hit ratio
      ├── pg_locks <datname>              ← lock chains (if deadlock suspected)
      └── pg_stat_replication             ← lag (if replication alert)


Final analysis (markdown with Causa Raíz / Evidencia / Acciones)


Supervisor._build_proposed_action()
      │  ├── memory_pressure   → pg_terminate_backend sentinel_demo
      │  ├── cpu_throttling    → pg_cancel_backend sentinel_demo
      │  └── unknown           → pg_stat_activity sentinel_demo

Status: awaiting_approval → human approves → verifying → resolved

Build docs developers (and LLMs) love