The PostgresAgent is Sentinel’s specialist for database incidents. It connects to PostgreSQL viaDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt
Use this file to discover all available pages before exploring further.
psycopg2, queries the pg_stat_* system views, and produces a structured diagnosis of connection exhaustion, deadlocks, long-running transactions, replication lag, and cache hit ratio degradation. Every action it proposes is a safe, targeted operation — from read-only observation (pg_stat_activity) to graceful query cancellation (pg_cancel_backend) or full connection termination (pg_terminate_backend) — and all require explicit human approval before execution.
Runtime Detection
The PostgresAgent claims an alert when either of the following is true:- The alert label
source_type=databaseis set. - The
targetfield starts withpostgres/.
Target Format
PostgreSQL incidents must use thepostgres/<datname> target format, where <datname> is the PostgreSQL database name:
| Example target | Database name extracted |
|---|---|
postgres/sentinel_demo | sentinel_demo |
postgres/myapp_prod | myapp_prod |
Prerequisites
Connection Configuration
The tools build a connection DSN in the following priority order:Individual connection variables
If
POSTGRES_DSN is not set, the agent assembles the DSN from individual variables:| Variable | Default | Description |
|---|---|---|
POSTGRES_HOST | localhost | PostgreSQL host |
POSTGRES_PORT | 5432 | PostgreSQL port |
POSTGRES_USER | sentinel | Database user |
POSTGRES_PASSWORD | (empty) | Password |
POSTGRES_DB | sentinel_demo | Database name for the connection |
postgres-demo service in docker-compose.yml, pre-configured with the sentinel user, sentinel123 password, and sentinel_demo database. Its metrics are scraped by the postgres-exporter service:
PostgresConnectionsExhausted and PostgresDeadlockDetected to fire from real Prometheus metrics.
Tools
All four tools are read-only — they useSELECT queries and never issue INSERT, UPDATE, DELETE, DROP, or any DDL. The tools connect with a short timeout (5 seconds) and close the connection after each call to avoid contributing to connection pressure during an incident.
pg_stat_activity
Queries
pg_stat_activity grouped by (datname, state, wait_event_type, wait_event). Returns connection counts per state (active, idle, idle in transaction), max query duration, and total connections vs. max_connections. Use this to diagnose connection pool exhaustion or blocked queries.pg_stat_database
Queries
pg_stat_database for accumulated stats: transaction commits/rollbacks, deadlock count, disk reads vs. cache hits (used to compute cache hit ratio %), and table mutation rates (inserts/updates/deletes). Use this to detect deadlocks, poor cache hit ratio, or abnormal data churn.pg_stat_replication
Queries
pg_stat_replication for all connected standbys: write lag, flush lag, and replay lag in both seconds and bytes. Only returns data when the connected server is a primary with at least one replica. Use this for replication lag alerts.pg_locks
Joins
pg_locks with pg_stat_activity to show active locks, waiting queries, lock modes, table names, and query text. Returns both granted and pending locks, ranked by duration. Use this to diagnose deadlocks or long-running transactions blocking others.Tool Parameters
| Tool | Parameter | Type | Default | Description |
|---|---|---|---|---|
pg_stat_activity | datname | string | "" | Filter by database name; empty returns all non-template databases |
pg_stat_database | datname | string | "" | Filter by database name; empty returns all non-template databases |
pg_stat_replication | (none) | — | — | Returns all replica connections |
pg_locks | datname | string | "" | Filter by database name; empty returns all databases |
Action Proposals
The Supervisor maps eachincident_type to one of three safe database actions. The action operates on the <datname> extracted from the incident target (postgres/<datname>).
- pg_terminate_backend
- pg_cancel_backend
- pg_stat_activity (read-only)
Terminates all connections to the named database. This is the most aggressive action and is reserved for incidents where the database is effectively unreachable due to saturation or cascading failures.
| Incident type | Proposed action |
|---|---|
app_crash | pg_terminate_backend <datname> |
dependency_failure | pg_terminate_backend <datname> |
restart_loop | pg_terminate_backend <datname> |
memory_pressure | pg_terminate_backend <datname> |
Action Execution Details
When the proposed action is approved,routers/actions.py executes it via _run_pg_command:
| Action | SQL executed | Side effects |
|---|---|---|
pg_stat_activity <datname> | SELECT pid, usename, state, ... FROM pg_stat_activity WHERE datname = %s | None — read-only |
pg_cancel_backend <datname> | SELECT pid, pg_cancel_backend(pid) FROM pg_stat_activity WHERE datname = %s AND state = 'active' | Cancels active queries; connections remain alive |
pg_terminate_backend <datname> | SELECT pid, pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = %s | Terminates all connections to the database |
autocommit=True and exclude the backend’s own PID (pid <> pg_backend_pid()) to avoid self-termination.
Simulating an Incident
Thedemo_postgres.sh.example script demonstrates a PostgresConnectionsExhausted incident against the sentinel_demo database. Here is the condensed simulation flow:
Review the PostgresAgent analysis
The agent calls
pg_stat_activity sentinel_demo and pg_stat_database sentinel_demo to retrieve live connection counts, query states, and cache hit ratio. It cross-references the runbooks-postgres collection and any similar past incidents before proposing pg_terminate_backend sentinel_demo (classified as dependency_failure or app_crash).