Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/devops-root-cause-analysis-agent/llms.txt

Use this file to discover all available pages before exploring further.

The DevOps RCA Agent gives you two ways to kick off an analysis: an interactive Streamlit dashboard designed for on-call engineers who need answers fast, and a Python API for teams that want to embed analysis into automated runbooks, CI pipelines, or custom tooling. Both paths queue a Celery task, query your configured data sources, and return a ranked list of root cause hypotheses — the only difference is how you invoke them.
The Streamlit dashboard runs on port 8501 by default and provides a point-and-click interface for incident investigation. No code required.
1

Open the dashboard

Navigate to http://localhost:8501 in your browser. If the application is running inside Docker, ensure port 8501 is mapped to your host. You should see the DevOps RCA Agent home screen with an empty analysis form.
2

Set the incident time window

Enter the Incident Start and Incident End timestamps using the datetime pickers. Times are interpreted in UTC by default. For a precise incident, a 30-minute window is a good starting point; you can always widen it later if results are inconclusive.
3

Add incident context (optional)

Paste any relevant alert text, Slack thread summary, or a brief description of the symptom into the Context field — for example, "High 5xx error rate on payment-service starting at 14:00 UTC". The agent uses this text to bias its signal scoring toward the affected service or component.
4

Select data sources

Check the boxes next to each data source you want to include in the analysis. The sources listed correspond to the connectors configured in agent.yaml. Enabling only relevant sources (e.g. Prometheus and Jaeger for a latency issue) produces more focused hypotheses and runs faster.
5

Trigger the analysis

Click Analyze. The agent serializes your request and dispatches it to the Celery worker queue backed by Redis. A task ID appears in the UI confirming that the job has been accepted.
6

Review results

A progress indicator tracks the analysis pipeline in real time. When the Celery task completes, the results panel renders automatically — no page refresh needed. You’ll see a ranked list of root cause hypotheses, each with a confidence score and expandable evidence excerpts drawn from your selected data sources.
Analysis time scales with the volume of signals retrieved and the number of sources queried. A 30-minute window across three sources typically completes in 20–60 seconds. Very wide windows (several hours) or high-cardinality metrics indices can take several minutes.

Re-running with a Wider Time Window

If the initial analysis returns low-confidence hypotheses or no clear root cause, the most effective first step is to expand the time window. Incidents often have precursor signals — a slow memory leak, a gradual traffic shift — that appear 30–60 minutes before the user-visible symptom. Try doubling the window on both ends: if your incident was 14:00–14:30, re-run with 13:30–15:00. You can also add additional data sources to bring in signals that might correlate with the failure — for example, adding an infrastructure connector if the initial run only used application logs.

Interpreting Results

Learn how to read confidence scores, evidence excerpts, and hypothesis rankings once your analysis completes.

Custom Data Sources

Add connectors for observability platforms not included out of the box.

Build docs developers (and LLMs) love