Trigger and Run a Root Cause Analysis with the RCA Agent

The DevOps RCA Agent gives you two ways to kick off an analysis: an interactive Streamlit dashboard designed for on-call engineers who need answers fast, and a Python API for teams that want to embed analysis into automated runbooks, CI pipelines, or custom tooling. Both paths queue a Celery task, query your configured data sources, and return a ranked list of root cause hypotheses — the only difference is how you invoke them.

Streamlit UI
Python API

The Streamlit dashboard runs on port 8501 by default and provides a point-and-click interface for incident investigation. No code required.

Open the dashboard

Navigate to http://localhost:8501 in your browser. If the application is running inside Docker, ensure port 8501 is mapped to your host. You should see the DevOps RCA Agent home screen with an empty analysis form.

Set the incident time window

Enter the Incident Start and Incident End timestamps using the datetime pickers. Times are interpreted in UTC by default. For a precise incident, a 30-minute window is a good starting point; you can always widen it later if results are inconclusive.

Add incident context (optional)

Paste any relevant alert text, Slack thread summary, or a brief description of the symptom into the Context field — for example, "High 5xx error rate on payment-service starting at 14:00 UTC". The agent uses this text to bias its signal scoring toward the affected service or component.

Select data sources

Check the boxes next to each data source you want to include in the analysis. The sources listed correspond to the connectors configured in agent.yaml. Enabling only relevant sources (e.g. Prometheus and Jaeger for a latency issue) produces more focused hypotheses and runs faster.

Trigger the analysis

Click Analyze. The agent serializes your request and dispatches it to the Celery worker queue backed by Redis. A task ID appears in the UI confirming that the job has been accepted.

Review results

A progress indicator tracks the analysis pipeline in real time. When the Celery task completes, the results panel renders automatically — no page refresh needed. You’ll see a ranked list of root cause hypotheses, each with a confidence score and expandable evidence excerpts drawn from your selected data sources.

Use the RCAAgent class directly when you need to integrate analysis into scripts, automated runbooks, or post-deployment checks.

from app.agent import RCAAgent
from datetime import datetime, timedelta

agent = RCAAgent()

result = agent.analyze(
    start_time=datetime(2024, 3, 15, 14, 0, 0),
    end_time=datetime(2024, 3, 15, 14, 30, 0),
    context="High error rate on payment-service",
    sources=["prometheus", "elasticsearch", "jaeger"],
)

for hypothesis in result.hypotheses:
    print(f"[{hypothesis.confidence:.0%}] {hypothesis.title}")
    print(f"  Evidence: {hypothesis.evidence_summary}")

agent.analyze() is a blocking call — it runs the full pipeline synchronously and returns an AnalysisResult object once all sources have been queried and the LLM has ranked the hypotheses. If you want non-blocking behaviour, use the Celery task directly via tasks.run_analysis.delay(...) and poll for the result with the returned AsyncResult.The sources parameter accepts a list of connector names matching the keys defined in CONNECTOR_REGISTRY (see app/connectors/__init__.py). Omitting it runs all connectors enabled in agent.yaml.

Analysis time scales with the volume of signals retrieved and the number of sources queried. A 30-minute window across three sources typically completes in 20–60 seconds. Very wide windows (several hours) or high-cardinality metrics indices can take several minutes.

Re-running with a Wider Time Window

If the initial analysis returns low-confidence hypotheses or no clear root cause, the most effective first step is to expand the time window. Incidents often have precursor signals — a slow memory leak, a gradual traffic shift — that appear 30–60 minutes before the user-visible symptom. Try doubling the window on both ends: if your incident was 14:00–14:30, re-run with 13:30–15:00. You can also add additional data sources to bring in signals that might correlate with the failure — for example, adding an infrastructure connector if the initial run only used application logs.

Interpreting Results

Learn how to read confidence scores, evidence excerpts, and hypothesis rankings once your analysis completes.

Custom Data Sources

Add connectors for observability platforms not included out of the box.

Get Started

Configuration

Guides

Reference

Trigger and Run a Root Cause Analysis with the RCA Agent

Re-running with a Wider Time Window

Interpreting Results

Custom Data Sources

Build docs developers (and LLMs) love

Get Started

Configuration

Guides

Reference

Documentation Index

​Re-running with a Wider Time Window

Interpreting Results

Custom Data Sources

Build docs developers (and LLMs) love

Re-running with a Wider Time Window