The KubernetesAgent investigates Kubernetes workload incidents using the official PythonDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/nicolas344/Sentinel-SoftServe/llms.txt
Use this file to discover all available pages before exploring further.
kubernetes SDK. It supports the full range of pod failure modes — CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending due to resource constraints, and deployment replica mismatches — through six read-only tools that query the cluster’s API server without modifying any state.
Runtime Detection
The KubernetesAgent activates when either of the following is true:- The alert label
container_runtime=kubernetesis set. - The
targetfield starts withpod/ordeployment/— regardless of runtime labels.
Target Format
Kubernetes incidents must use a prefixed target that identifies both the resource type and the name:| Resource | Target format | Example |
|---|---|---|
| Pod | pod/<name> | pod/crash-test |
| Deployment | deployment/<name> | deployment/api-server |
namespace (defaults to default).
Kubeconfig Loading
The tools initialize the Kubernetes SDK once and cache the result. The loading order mirrors standardkubectl behavior:
K8S_PROXY_URL (highest priority)
If Then set in
K8S_PROXY_URL is set, the SDK connects to that HTTP endpoint directly with TLS verification disabled. This is the recommended approach for Docker Desktop environments where 127.0.0.1:6443 is not reachable from inside the backend container.docker-compose.yml:In-cluster config
When the backend runs as a pod inside the cluster, the SDK picks up the service account token and CA bundle automatically via
load_incluster_config().Docker Desktop on macOS/Windows runs Kubernetes inside a VM. The API server listens on
127.0.0.1:6443 inside the VM, which is not the same 127.0.0.1 that Sentinel’s backend container sees. Run kubectl proxy on the LAN IP of your machine (not localhost) and set K8S_PROXY_URL to that address. The --accept-hosts='.*' flag is required so the proxy accepts connections from outside localhost.Tools
All six tools are read-only and use theCoreV1Api and AppsV1Api clients from the official Kubernetes Python SDK. They never create, patch, or delete any cluster resource.
get_pod_status
Pod phase (
Running/Pending/Failed), container states (including CrashLoopBackOff, OOMKilled, ImagePullBackOff), restart count per container, assigned node, and pod IP. Start here for any pod incident.describe_pod
Full pod spec: CPU/memory requests and limits, liveness and readiness probe configuration, volume names, init containers, environment variable count, service account, and node selector.
get_pod_logs
Last N lines of pod logs (default 100, max 500). Automatically falls back to
previous=True when the current container is not running — essential for CrashLoopBackOff diagnosis.get_pod_events
Kubernetes events for a specific pod, sorted by
last_timestamp descending (up to 20). Captures OOMKill, BackOff, FailedMount, FailedScheduling, and Pulling/Failed image events.get_deployment_status
Desired, ready, available, and updated replica counts; update strategy type; deployment conditions (
Available, Progressing); and pod selector labels. Use for replica mismatch or stuck rollout incidents.list_failing_pods
All pods in the namespace that are not
Running or Succeeded. Includes phase, problem description (CrashLoopBackOff, ImagePullBackOff, NotReady), total restart count, and assigned node.Tool Parameters
| Tool | Parameter | Type | Default | Description |
|---|---|---|---|---|
get_pod_status | pod_name | string | — | Pod name |
get_pod_status | namespace | string | "default" | Kubernetes namespace |
describe_pod | pod_name | string | — | Pod name |
describe_pod | namespace | string | "default" | Kubernetes namespace |
get_pod_logs | pod_name | string | — | Pod name |
get_pod_logs | namespace | string | "default" | Kubernetes namespace |
get_pod_logs | tail | integer | 100 | Lines to return (max 500) |
get_pod_events | pod_name | string | — | Pod name |
get_pod_events | namespace | string | "default" | Kubernetes namespace |
get_deployment_status | deployment_name | string | — | Deployment name |
get_deployment_status | namespace | string | "default" | Kubernetes namespace |
list_failing_pods | namespace | string | "default" | Namespace to inspect |
Failure Modes the Agent Recognizes
The KubernetesAgent’s system prompt trains it to diagnose the following Kubernetes-specific failure patterns:CrashLoopBackOff
state.waiting.reason = "CrashLoopBackOff", high restart count, non-zero exit code. Caused by bad config, missing env vars, immediate OOMKill, or aggressive liveness probes.OOMKilled
state.terminated.reason = "OOMKilled", exit code 137. Memory limit too low or application memory leak. Remediation: increase limit or investigate heap.ImagePullBackOff
state.waiting.reason = "ImagePullBackOff" or "ErrImagePull". Missing tag, wrong registry, expired ImagePullSecret, or network/DNS issue.Pending — Unschedulable
phase = "Pending" with events listing Insufficient cpu/memory, Unschedulable, unmatched nodeSelector, or missing toleration for a taint.Pod NotReady
conditions[Ready].status = "False" while the pod is Running. Application still starting, readiness probe too strict, or external dependency unavailable.Replica Mismatch
ready_replicas < spec.replicas. Caused by pods in CrashLoopBackOff, node unavailability, or a blocking PodDisruptionBudget.Action Proposals
The Supervisor’s_build_proposed_action function maps resource type and incident type to one of three whitelisted kubectl commands. All commands require human approval before execution.
- Pod incidents
- Deployment incidents
For
pod/<name> targets with a crashy incident type, the agent proposes a pod deletion (Kubernetes recreates it from the ReplicaSet):| Incident type | Proposed action |
|---|---|
app_crash | kubectl delete pod <name> -n <namespace> |
oom | kubectl delete pod <name> -n <namespace> |
restart_loop | kubectl delete pod <name> -n <namespace> |
dependency_failure | kubectl delete pod <name> -n <namespace> |
config_error | kubectl delete pod <name> -n <namespace> |
routers/actions.py validates commands with _validate_kubernetes_command, which rejects any token containing shell metacharacters (;, &&, |, `, $(, >, <, \n) and enforces RFC 1123 name patterns via ^[a-z0-9][a-z0-9\-.]{0,251}[a-z0-9]$.
The action validator also accepts
kubectl scale deployment/<name> --replicas=<0-10> -n <namespace> as a valid command format (replicas must be an integer from 0 to 10). The Supervisor never auto-generates a scale proposal — this format is available only if an operator manually sets it as the proposed_action value on an incident record before approval.Simulating an Incident
Thedemo_kubernetes.sh.example script automates the full simulation. Here is the core sequence:
Review the analysis
The KubernetesAgent calls
get_pod_status, get_pod_events, and get_pod_logs in sequence. It identifies CrashLoopBackOff, the non-zero exit code, and the restart count. The proposed action will be kubectl delete pod crash-test -n default.