Documentation Index
Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/devops-root-cause-analysis-agent/llms.txt
Use this file to discover all available pages before exploring further.
The RCA Agent treats every observability platform as a pluggable data source. During an analysis run, enabled integrations are queried in parallel — each fetching its signals for the configured time window — and their results are merged into a unified evidence payload before the LLM reasoning step. Credentials for each integration are supplied through environment variables; which integrations are actually active at runtime is controlled separately in agent.yaml under the sources list. This separation means you can add credentials for an integration without activating it until you’re ready.
To enable or disable specific integrations at runtime without removing credentials, edit the analysis.sources list in agent.yaml. Only sources listed there are queried during analysis. See Agent Settings for the full configuration reference.
Metrics
Logs
Traces
Alerting
The agent supports two metrics backends. Configure one or both; the agent queries whichever are listed under analysis.sources.Prometheus
Prometheus is queried via its HTTP API. The agent executes PromQL expressions for CPU, memory, error-rate, and latency metrics over the analysis window and identifies time-series anomalies before passing them to the LLM.PROMETHEUS_URL=http://prometheus:9090
PROMETHEUS_BEARER_TOKEN=
Base URL of your Prometheus server, without a trailing slash. For Prometheus deployed inside a Kubernetes cluster, use the in-cluster service address (e.g. http://prometheus-operated.monitoring.svc.cluster.local:9090).
Bearer token for Prometheus endpoints protected by token-based authentication (common on managed Prometheus services). Leave empty for unauthenticated instances. The token is sent as Authorization: Bearer <token> on every request.
To verify connectivity, run a quick instant-query against the Prometheus HTTP API:curl "${PROMETHEUS_URL}/api/v1/query" \
--data-urlencode 'query=up' \
-H "Authorization: Bearer ${PROMETHEUS_BEARER_TOKEN}"
A healthy response returns "status": "success" with a result array of scrape targets.
Datadog
When Datadog is enabled, the agent uses the Datadog Metrics API to retrieve timeseries data and the Events API to pull correlated Datadog monitors that fired during the analysis window.DATADOG_API_KEY=dd_api_...
DATADOG_APP_KEY=dd_app_...
DATADOG_SITE=datadoghq.com
Datadog API key, found under Organization Settings → API Keys. This key authorizes read access to metrics, logs, and events.
Datadog Application key, found under Organization Settings → Application Keys. Required in addition to the API key to query metric data programmatically.
DATADOG_SITE
string
default:"datadoghq.com"
The Datadog regional site your account is hosted on. Use datadoghq.eu for EU accounts, us3.datadoghq.com for US3, or ddog-gov.com for GovCloud. This value determines the base URL for all API calls.
Log data is one of the most signal-rich inputs for root cause analysis. The agent searches for error patterns, stack traces, and anomalous log rates across the analysis window using keyword and structured-field queries.Elasticsearch
Elasticsearch (and OpenSearch, which shares the same query API) is the primary log backend. The agent queries the specified index with a time-bounded search for ERROR and WARN level entries and extracts the top recurring message patterns.ELASTICSEARCH_URL=http://elasticsearch:9200
ELASTICSEARCH_USERNAME=elastic
ELASTICSEARCH_PASSWORD=changeme
ELASTICSEARCH_INDEX=logs-*
Base URL of the Elasticsearch cluster. For Elastic Cloud, use the HTTPS endpoint from your deployment dashboard (e.g. https://my-deployment.es.us-east-1.aws.elastic-cloud.com:443).
Username for HTTP Basic authentication. For fine-grained access control, create a dedicated read-only role with access to the log indices rather than using the superuser elastic account.
Password for the Elasticsearch user. Rotate this credential regularly and avoid reusing it across environments.
The index pattern to search. Accepts wildcards. Use a data stream pattern like logs-*-* for Fleet-managed Elastic Agent deployments, or a custom pattern matching your Logstash/Filebeat index naming convention.
Loki
Grafana Loki is queried via its HTTP API using LogQL. The agent issues a range query for {level="error"} log streams over the analysis window and parses returned log lines for recurring error signatures.LOKI_URL=http://loki:3100
LOKI_TENANT_ID=
Base URL of your Loki instance. For Grafana Cloud Loki, use the cluster-specific URL from your Grafana Cloud portal (e.g. https://logs-prod-us-central1.grafana.net).
The X-Scope-OrgID header value for multi-tenant Loki deployments. Required for Grafana Cloud Loki — use the numeric org ID shown in the Grafana Cloud portal. Leave empty for single-tenant (OSS) Loki.
If your Loki instance requires HTTP Basic authentication (common on Grafana Cloud), set LOKI_URL to include credentials in the URL itself: https://user:password@logs-prod-us-central1.grafana.net.
Distributed traces reveal which service calls were slow or failing during an incident. The agent retrieves traces from the analysis window, identifies spans with high error rates or elevated latency, and correlates them with log and metric anomalies.Jaeger
Jaeger is queried via its Query Service HTTP API. The agent retrieves traces for all services that appear in the log and metric signals already collected, focusing on spans with error=true tags.JAEGER_QUERY_URL=http://jaeger:16686
Base URL of the Jaeger Query service. The agent calls the /api/traces and /api/services endpoints on this host. For Jaeger deployed on Kubernetes, use the in-cluster service address (e.g. http://jaeger-query.observability.svc.cluster.local:16686).
To verify the Jaeger connection and list available services:curl "${JAEGER_QUERY_URL}/api/services"
OpenTelemetry Collector
If you use an OpenTelemetry Collector as a central telemetry pipeline, you can configure the agent to export its own internal telemetry spans to it via the OTLP HTTP protocol. This allows the agent’s analysis pipeline activity to be captured alongside your application traces in your existing observability backend.OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_EXPORTER_OTLP_ENDPOINT
The OTLP HTTP endpoint of your OpenTelemetry Collector. The agent exports its own internal spans to this address, making the analysis pipeline observable via your existing tracing backend. Use port 4317 for gRPC OTLP and 4318 for HTTP/protobuf OTLP. Leave unset to disable agent telemetry export.
For collecting application traces as input to the RCA analysis, configure Jaeger (above) as your traces source rather than relying on the OTLP endpoint. OTEL_EXPORTER_OTLP_ENDPOINT is for exporting the agent’s own operational spans, not for querying your application’s trace data.
After completing an analysis, the agent can push a structured summary — including ranked hypotheses and supporting evidence links — to notification destinations. This enables on-call engineers to receive actionable RCA results directly in their incident channels without navigating to the Streamlit UI.Slack
The agent posts analysis summaries to Slack using an Incoming Webhook URL. The message includes the top hypothesis, a confidence score, and a direct link to the full analysis in the Streamlit UI.SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...
The Incoming Webhook URL generated for your Slack app. To create one, go to api.slack.com/apps, select your app (or create one), navigate to Incoming Webhooks, and activate it for the target channel. The agent sends a payload structured like this:{
"text": "🔍 RCA Complete: *CassandraLatencySpike* — 2 hypotheses found",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Top Hypothesis (confidence: 0.87):*\nCompaction storm on cassandra-node-3 caused latency p99 to spike from 12ms to 340ms. Correlated with 94 ERROR log lines and a 6× increase in GC pause duration."
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": { "type": "plain_text", "text": "View Full Analysis" },
"url": "http://rca-agent:8501/analysis/abc123"
}
]
}
]
}
When a high-confidence root cause is identified, the agent can acknowledge or annotate open PagerDuty incidents using the Events API v2. This enriches the incident timeline with RCA results without requiring a human to copy-paste findings.PAGERDUTY_ROUTING_KEY=R...
The integration key (routing key) for a PagerDuty Events API v2 integration. Find or create this under Services → Integrations → Add Integration → Events API v2 in the PagerDuty console. This key routes the RCA annotation to the correct service’s incident timeline.
The agent uses the PagerDuty Events API v2 (https://events.pagerduty.com/v2/enqueue) to send trigger and resolve event actions. It does not use the older Integration API v1 format. Ensure the routing key is from an Events API v2 integration, not a legacy Email or API v1 integration.
Adding Custom Connectors
The integrations above cover the most common observability stacks, but your environment may use different tools. See the Custom Data Sources guide to learn how to implement the DataSource interface and register a new connector with the agent’s plugin system.