Connecting Observability Integrations to the RCA Agent

The RCA Agent treats every observability platform as a pluggable data source. During an analysis run, enabled integrations are queried in parallel — each fetching its signals for the configured time window — and their results are merged into a unified evidence payload before the LLM reasoning step. Credentials for each integration are supplied through environment variables; which integrations are actually active at runtime is controlled separately in agent.yaml under the sources list. This separation means you can add credentials for an integration without activating it until you’re ready.

To enable or disable specific integrations at runtime without removing credentials, edit the analysis.sources list in agent.yaml. Only sources listed there are queried during analysis. See Agent Settings for the full configuration reference.

Metrics
Logs
Traces
Alerting

The agent supports two metrics backends. Configure one or both; the agent queries whichever are listed under analysis.sources.

Prometheus

Prometheus is queried via its HTTP API. The agent executes PromQL expressions for CPU, memory, error-rate, and latency metrics over the analysis window and identifies time-series anomalies before passing them to the LLM.

.env

PROMETHEUS_URL=http://prometheus:9090
PROMETHEUS_BEARER_TOKEN=

PROMETHEUS_URL

string

required

Base URL of your Prometheus server, without a trailing slash. For Prometheus deployed inside a Kubernetes cluster, use the in-cluster service address (e.g. http://prometheus-operated.monitoring.svc.cluster.local:9090).

PROMETHEUS_BEARER_TOKEN

string

Bearer token for Prometheus endpoints protected by token-based authentication (common on managed Prometheus services). Leave empty for unauthenticated instances. The token is sent as Authorization: Bearer <token> on every request.

To verify connectivity, run a quick instant-query against the Prometheus HTTP API:

curl "${PROMETHEUS_URL}/api/v1/query" \
  --data-urlencode 'query=up' \
  -H "Authorization: Bearer ${PROMETHEUS_BEARER_TOKEN}"

A healthy response returns "status": "success" with a result array of scrape targets.

Datadog

When Datadog is enabled, the agent uses the Datadog Metrics API to retrieve timeseries data and the Events API to pull correlated Datadog monitors that fired during the analysis window.

.env

DATADOG_API_KEY=dd_api_...
DATADOG_APP_KEY=dd_app_...
DATADOG_SITE=datadoghq.com

DATADOG_API_KEY

string

required

Datadog API key, found under Organization Settings → API Keys. This key authorizes read access to metrics, logs, and events.

DATADOG_APP_KEY

string

required

Datadog Application key, found under Organization Settings → Application Keys. Required in addition to the API key to query metric data programmatically.

DATADOG_SITE

string

default:"datadoghq.com"

The Datadog regional site your account is hosted on. Use datadoghq.eu for EU accounts, us3.datadoghq.com for US3, or ddog-gov.com for GovCloud. This value determines the base URL for all API calls.

Log data is one of the most signal-rich inputs for root cause analysis. The agent searches for error patterns, stack traces, and anomalous log rates across the analysis window using keyword and structured-field queries.

Elasticsearch

Elasticsearch (and OpenSearch, which shares the same query API) is the primary log backend. The agent queries the specified index with a time-bounded search for ERROR and WARN level entries and extracts the top recurring message patterns.

.env

ELASTICSEARCH_URL=http://elasticsearch:9200
ELASTICSEARCH_USERNAME=elastic
ELASTICSEARCH_PASSWORD=changeme
ELASTICSEARCH_INDEX=logs-*

ELASTICSEARCH_URL

string

required

Base URL of the Elasticsearch cluster. For Elastic Cloud, use the HTTPS endpoint from your deployment dashboard (e.g. https://my-deployment.es.us-east-1.aws.elastic-cloud.com:443).

ELASTICSEARCH_USERNAME

string

default:"elastic"

Username for HTTP Basic authentication. For fine-grained access control, create a dedicated read-only role with access to the log indices rather than using the superuser elastic account.

ELASTICSEARCH_PASSWORD

string

required

Password for the Elasticsearch user. Rotate this credential regularly and avoid reusing it across environments.

ELASTICSEARCH_INDEX

string

default:"logs-*"

The index pattern to search. Accepts wildcards. Use a data stream pattern like logs-*-* for Fleet-managed Elastic Agent deployments, or a custom pattern matching your Logstash/Filebeat index naming convention.

Loki

Grafana Loki is queried via its HTTP API using LogQL. The agent issues a range query for {level="error"} log streams over the analysis window and parses returned log lines for recurring error signatures.

.env

LOKI_URL=http://loki:3100
LOKI_TENANT_ID=

LOKI_URL

string

required

Base URL of your Loki instance. For Grafana Cloud Loki, use the cluster-specific URL from your Grafana Cloud portal (e.g. https://logs-prod-us-central1.grafana.net).

LOKI_TENANT_ID

string

The X-Scope-OrgID header value for multi-tenant Loki deployments. Required for Grafana Cloud Loki — use the numeric org ID shown in the Grafana Cloud portal. Leave empty for single-tenant (OSS) Loki.

If your Loki instance requires HTTP Basic authentication (common on Grafana Cloud), set LOKI_URL to include credentials in the URL itself: https://user:password@logs-prod-us-central1.grafana.net.

Distributed traces reveal which service calls were slow or failing during an incident. The agent retrieves traces from the analysis window, identifies spans with high error rates or elevated latency, and correlates them with log and metric anomalies.

Jaeger

Jaeger is queried via its Query Service HTTP API. The agent retrieves traces for all services that appear in the log and metric signals already collected, focusing on spans with error=true tags.

.env

JAEGER_QUERY_URL=http://jaeger:16686

JAEGER_QUERY_URL

string

required

Base URL of the Jaeger Query service. The agent calls the /api/traces and /api/services endpoints on this host. For Jaeger deployed on Kubernetes, use the in-cluster service address (e.g. http://jaeger-query.observability.svc.cluster.local:16686).

To verify the Jaeger connection and list available services:

curl "${JAEGER_QUERY_URL}/api/services"

OpenTelemetry Collector

If you use an OpenTelemetry Collector as a central telemetry pipeline, you can configure the agent to export its own internal telemetry spans to it via the OTLP HTTP protocol. This allows the agent’s analysis pipeline activity to be captured alongside your application traces in your existing observability backend.

.env

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

OTEL_EXPORTER_OTLP_ENDPOINT

string

The OTLP HTTP endpoint of your OpenTelemetry Collector. The agent exports its own internal spans to this address, making the analysis pipeline observable via your existing tracing backend. Use port 4317 for gRPC OTLP and 4318 for HTTP/protobuf OTLP. Leave unset to disable agent telemetry export.

For collecting application traces as input to the RCA analysis, configure Jaeger (above) as your traces source rather than relying on the OTLP endpoint. OTEL_EXPORTER_OTLP_ENDPOINT is for exporting the agent’s own operational spans, not for querying your application’s trace data.

After completing an analysis, the agent can push a structured summary — including ranked hypotheses and supporting evidence links — to notification destinations. This enables on-call engineers to receive actionable RCA results directly in their incident channels without navigating to the Streamlit UI.

Slack

The agent posts analysis summaries to Slack using an Incoming Webhook URL. The message includes the top hypothesis, a confidence score, and a direct link to the full analysis in the Streamlit UI.

.env

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...

SLACK_WEBHOOK_URL

string

The Incoming Webhook URL generated for your Slack app. To create one, go to api.slack.com/apps, select your app (or create one), navigate to Incoming Webhooks, and activate it for the target channel.

The agent sends a payload structured like this:

{
  "text": "🔍 RCA Complete: *CassandraLatencySpike* — 2 hypotheses found",
  "blocks": [
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Top Hypothesis (confidence: 0.87):*\nCompaction storm on cassandra-node-3 caused latency p99 to spike from 12ms to 340ms. Correlated with 94 ERROR log lines and a 6× increase in GC pause duration."
      }
    },
    {
      "type": "actions",
      "elements": [
        {
          "type": "button",
          "text": { "type": "plain_text", "text": "View Full Analysis" },
          "url": "http://rca-agent:8501/analysis/abc123"
        }
      ]
    }
  ]
}

PagerDuty

When a high-confidence root cause is identified, the agent can acknowledge or annotate open PagerDuty incidents using the Events API v2. This enriches the incident timeline with RCA results without requiring a human to copy-paste findings.

.env

PAGERDUTY_ROUTING_KEY=R...

PAGERDUTY_ROUTING_KEY

string

The integration key (routing key) for a PagerDuty Events API v2 integration. Find or create this under Services → Integrations → Add Integration → Events API v2 in the PagerDuty console. This key routes the RCA annotation to the correct service’s incident timeline.

The agent uses the PagerDuty Events API v2 (https://events.pagerduty.com/v2/enqueue) to send trigger and resolve event actions. It does not use the older Integration API v1 format. Ensure the routing key is from an Events API v2 integration, not a legacy Email or API v1 integration.

Adding Custom Connectors

The integrations above cover the most common observability stacks, but your environment may use different tools. See the Custom Data Sources guide to learn how to implement the DataSource interface and register a new connector with the agent’s plugin system.

Get Started

Configuration

Guides

Reference

Connecting Observability Integrations to the RCA Agent

Prometheus

Datadog

Elasticsearch

Loki

Jaeger

OpenTelemetry Collector

Slack

PagerDuty

Adding Custom Connectors

Build docs developers (and LLMs) love

Get Started

Configuration

Guides

Reference

Documentation Index

​Prometheus

​Datadog

​Elasticsearch

​Loki

​Jaeger

​OpenTelemetry Collector

​Slack

​PagerDuty

​Adding Custom Connectors

Build docs developers (and LLMs) love

Prometheus

Datadog

Elasticsearch

Loki

Jaeger

OpenTelemetry Collector

Slack

PagerDuty

Adding Custom Connectors