Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt

Use this file to discover all available pages before exploring further.

The KubernetesAgent investigates Kubernetes workload incidents using the official Python kubernetes SDK. It supports the full range of pod failure modes — CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending due to resource constraints, and deployment replica mismatches — through six read-only tools that query the cluster’s API server without modifying any state.

Runtime Detection

The KubernetesAgent activates when either of the following is true:
  1. The alert label container_runtime=kubernetes is set.
  2. The target field starts with pod/ or deployment/ — regardless of runtime labels.
# services/agents/kubernetes/agent.py — KubernetesAgent.matches()
def matches(self, ctx: IncidentContext) -> bool:
    source = (ctx.labels.get("source_type") or "").lower()
    if source == "database":
        return False
    runtime = (ctx.labels.get("container_runtime") or "").lower()
    if runtime == "kubernetes":
        return True
    target = (ctx.target or "").lower()
    return target.startswith("pod/") or target.startswith("deployment/")

Target Format

Kubernetes incidents must use a prefixed target that identifies both the resource type and the name:
ResourceTarget formatExample
Podpod/<name>pod/crash-test
Deploymentdeployment/<name>deployment/api-server
The namespace is read from the alert label namespace (defaults to default).

Kubeconfig Loading

The tools initialize the Kubernetes SDK once and cache the result. The loading order mirrors standard kubectl behavior:
1

K8S_PROXY_URL (highest priority)

If K8S_PROXY_URL is set, the SDK connects to that HTTP endpoint directly with TLS verification disabled. This is the recommended approach for Docker Desktop environments where 127.0.0.1:6443 is not reachable from inside the backend container.
# Start the proxy on your LAN IP before running Sentinel
kubectl proxy --port=8555 --address=<LAN_IP> --accept-hosts='.*' &
Then set in docker-compose.yml:
environment:
  K8S_PROXY_URL: "http://192.168.1.42:8555"
2

In-cluster config

When the backend runs as a pod inside the cluster, the SDK picks up the service account token and CA bundle automatically via load_incluster_config().
3

KUBECONFIG / ~/.kube/config

Falls back to the standard kubeconfig file — either from the KUBECONFIG environment variable or ~/.kube/config. This is the typical path for production deployments where the backend runs on a node with cluster credentials.
Docker Desktop on macOS/Windows runs Kubernetes inside a VM. The API server listens on 127.0.0.1:6443 inside the VM, which is not the same 127.0.0.1 that Sentinel’s backend container sees. Run kubectl proxy on the LAN IP of your machine (not localhost) and set K8S_PROXY_URL to that address. The --accept-hosts='.*' flag is required so the proxy accepts connections from outside localhost.

Tools

All six tools are read-only and use the CoreV1Api and AppsV1Api clients from the official Kubernetes Python SDK. They never create, patch, or delete any cluster resource.

get_pod_status

Pod phase (Running/Pending/Failed), container states (including CrashLoopBackOff, OOMKilled, ImagePullBackOff), restart count per container, assigned node, and pod IP. Start here for any pod incident.

describe_pod

Full pod spec: CPU/memory requests and limits, liveness and readiness probe configuration, volume names, init containers, environment variable count, service account, and node selector.

get_pod_logs

Last N lines of pod logs (default 100, max 500). Automatically falls back to previous=True when the current container is not running — essential for CrashLoopBackOff diagnosis.

get_pod_events

Kubernetes events for a specific pod, sorted by last_timestamp descending (up to 20). Captures OOMKill, BackOff, FailedMount, FailedScheduling, and Pulling/Failed image events.

get_deployment_status

Desired, ready, available, and updated replica counts; update strategy type; deployment conditions (Available, Progressing); and pod selector labels. Use for replica mismatch or stuck rollout incidents.

list_failing_pods

All pods in the namespace that are not Running or Succeeded. Includes phase, problem description (CrashLoopBackOff, ImagePullBackOff, NotReady), total restart count, and assigned node.

Tool Parameters

ToolParameterTypeDefaultDescription
get_pod_statuspod_namestringPod name
get_pod_statusnamespacestring"default"Kubernetes namespace
describe_podpod_namestringPod name
describe_podnamespacestring"default"Kubernetes namespace
get_pod_logspod_namestringPod name
get_pod_logsnamespacestring"default"Kubernetes namespace
get_pod_logstailinteger100Lines to return (max 500)
get_pod_eventspod_namestringPod name
get_pod_eventsnamespacestring"default"Kubernetes namespace
get_deployment_statusdeployment_namestringDeployment name
get_deployment_statusnamespacestring"default"Kubernetes namespace
list_failing_podsnamespacestring"default"Namespace to inspect

Failure Modes the Agent Recognizes

The KubernetesAgent’s system prompt trains it to diagnose the following Kubernetes-specific failure patterns:

CrashLoopBackOff

state.waiting.reason = "CrashLoopBackOff", high restart count, non-zero exit code. Caused by bad config, missing env vars, immediate OOMKill, or aggressive liveness probes.

OOMKilled

state.terminated.reason = "OOMKilled", exit code 137. Memory limit too low or application memory leak. Remediation: increase limit or investigate heap.

ImagePullBackOff

state.waiting.reason = "ImagePullBackOff" or "ErrImagePull". Missing tag, wrong registry, expired ImagePullSecret, or network/DNS issue.

Pending — Unschedulable

phase = "Pending" with events listing Insufficient cpu/memory, Unschedulable, unmatched nodeSelector, or missing toleration for a taint.

Pod NotReady

conditions[Ready].status = "False" while the pod is Running. Application still starting, readiness probe too strict, or external dependency unavailable.

Replica Mismatch

ready_replicas < spec.replicas. Caused by pods in CrashLoopBackOff, node unavailability, or a blocking PodDisruptionBudget.

Action Proposals

The Supervisor’s _build_proposed_action function maps resource type and incident type to one of three whitelisted kubectl commands. All commands require human approval before execution.
For pod/<name> targets with a crashy incident type, the agent proposes a pod deletion (Kubernetes recreates it from the ReplicaSet):
Incident typeProposed action
app_crashkubectl delete pod <name> -n <namespace>
oomkubectl delete pod <name> -n <namespace>
restart_loopkubectl delete pod <name> -n <namespace>
dependency_failurekubectl delete pod <name> -n <namespace>
config_errorkubectl delete pod <name> -n <namespace>
The action executor in routers/actions.py validates commands with _validate_kubernetes_command, which rejects any token containing shell metacharacters (;, &&, |, `, $(, >, <, \n) and enforces RFC 1123 name patterns via ^[a-z0-9][a-z0-9\-.]{0,251}[a-z0-9]$.
The action validator also accepts kubectl scale deployment/<name> --replicas=<0-10> -n <namespace> as a valid command format (replicas must be an integer from 0 to 10). The Supervisor never auto-generates a scale proposal — this format is available only if an operator manually sets it as the proposed_action value on an incident record before approval.

Simulating an Incident

The demo_kubernetes.sh.example script automates the full simulation. Here is the core sequence:
1

Deploy a pod that crashes in a loop

kubectl run crash-test \
  --image=busybox:latest \
  --restart=Always \
  --namespace=default \
  -- /bin/sh -c "echo 'sentinel kubernetes crash test'; sleep 2; exit 1"

# Wait for CrashLoopBackOff
sleep 15
kubectl get pod crash-test -n default
After ~15 seconds you should see something like:
NAME         READY   STATUS             RESTARTS   AGE
crash-test   0/1     CrashLoopBackOff   3          40s
2

Send the Alertmanager-format payload

RESTARTS=$(kubectl get pod crash-test -n default --no-headers | awk '{print $4}')

curl -s -X POST http://localhost:8000/api/alerts \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "status": "firing",
    "alerts": [{
      "status": "firing",
      "labels": {
        "alertname": "KubePodCrashLoopBackOff",
        "severity": "critical",
        "name": "crash-test",
        "pod": "crash-test",
        "namespace": "default",
        "container": "crash-test",
        "container_runtime": "kubernetes",
        "source_type": "container"
      },
      "annotations": {
        "summary": "CrashLoopBackOff: crash-test (default)",
        "description": "Container crash-test is in CrashLoopBackOff. Restarts: '"$RESTARTS"'. Process exits with code 1 continuously."
      },
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
    }]
  }'
3

Review the analysis

The KubernetesAgent calls get_pod_status, get_pod_events, and get_pod_logs in sequence. It identifies CrashLoopBackOff, the non-zero exit code, and the restart count. The proposed action will be kubectl delete pod crash-test -n default.
4

Approve and clean up

Approve the action in the Sentinel dashboard (http://localhost:5173). After resolution:
kubectl delete pod crash-test --ignore-not-found=true --grace-period=0

Build docs developers (and LLMs) love