Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/devops-root-cause-analysis-agent/llms.txt

Use this file to discover all available pages before exploring further.

When an engineer submits an incident context through the Streamlit UI, the RCA Agent launches a four-stage pipeline that transforms raw observability signals into a prioritized, evidence-backed list of root cause hypotheses. Each stage runs as an asynchronous Celery task, allowing the agent to fetch from multiple data sources concurrently, process large signal volumes without blocking the UI, and hand off structured intermediate results between stages through Redis. The sections below trace a single analysis request from the moment it enters the system to the ranked output displayed on screen.

The Four Pipeline Stages

1

Stage 1 — Signal Ingestion

The ingestion stage is responsible for pulling raw observability data from every connected source that is relevant to the submitted incident context.The agent receives three inputs from the UI form: a service name, a time window (start and end timestamps), and an alert description. These form the IncidentContext object that is serialized and passed to the first Celery task: ingest_signals.Within ingest_signals, the agent iterates over all registered data source connectors and dispatches a sub-task for each one in parallel. Three connector categories are supported out of the box:
  • Log connectors — query a logging backend for entries emitted by the target service within the time window. Log lines are streamed in batches to avoid holding large payloads in memory.
  • Metrics connectors — pull time-series data points for key indicators such as request rate, error rate, latency percentiles, CPU utilization, and memory usage. Each metric is fetched as a sequence of (timestamp, value) tuples.
  • Trace connectors — retrieve distributed trace spans associated with the affected service, including span durations, status codes, and inter-service call graphs.
Because each connector sub-task runs concurrently, the total ingestion time is bounded by the slowest individual source rather than the sum of all sources. Raw results from each connector are pushed onto a shared Redis key that the next stage reads from.
If no live data sources are configured, the ingestion stage loads bundled sample signals from app/data/samples/ so you can explore the full pipeline without connecting an observability backend.
2

Stage 2 — Preprocessing & Normalization

Raw signals from different backends arrive in incompatible formats, with misaligned timestamps, varying granularities, and significant noise. The preprocess_signals task normalizes everything into a shared intermediate representation before any AI reasoning occurs.Key operations performed in this stage:Timestamp alignment — all events and data points are converted to UTC and snapped to a common 1-second resolution timeline that spans the full incident window. This makes cross-source temporal correlation possible in the next stage.Log filtering & structuring — log lines are parsed for severity, service name, and structured fields where present (JSON logs, logfmt). Lines at DEBUG severity and lines outside a configurable relevance window are dropped to reduce LLM context size. The remaining lines are deduplicated using a rolling hash to suppress repetitive error floods.Metric anomaly extraction — rather than passing thousands of raw data points to the LLM, the preprocessing stage runs a statistical anomaly detector over each time series. It identifies change points — moments where a metric deviates significantly from its baseline — and emits a compact anomaly record: { metric, timestamp, baseline_value, observed_value, z_score }. Only anomaly records, not raw series, are forwarded to the AI stage.Trace span summarization — trace data is flattened into a call graph summary annotated with error spans and high-latency outliers. Spans within normal latency ranges are aggregated rather than enumerated individually.The output of this stage is a NormalizedSignalBundle — a structured Python object containing aligned log excerpts, anomaly records, and trace summaries, all indexed by timestamp and service name.
The RCA_MAX_LOG_LINES environment variable controls how many log lines survive the filtering step. The default is 500. Lower this value if you encounter LLM context-length errors with verbose services.
3

Stage 3 — AI Correlation & Hypothesis Generation

The generate_hypotheses task is where the agent’s AI reasoning occurs. It receives the NormalizedSignalBundle from the previous stage and orchestrates a multi-step LLM prompt chain designed to identify causal relationships and generate candidate root causes.The prompt chain runs in three passes:Pass 1 — Timeline reconstruction. The LLM is given the full normalized signal bundle and asked to construct a causal timeline: which anomalies appeared first, which followed, and which are likely correlated. This pass produces a structured timeline of events ordered by inferred causality rather than wall-clock time alone.Pass 2 — Hypothesis generation. Using the causal timeline as context, the LLM is prompted to generate a list of distinct root cause hypotheses. Each hypothesis must cite specific evidence from the signal bundle — a log excerpt, a metric anomaly record, or a trace span — that supports the proposed cause. The model is instructed to generate between three and seven hypotheses ranked by its own internal confidence.Pass 3 — Evidence verification. Each hypothesis from Pass 2 is re-evaluated in isolation: the LLM is shown only the hypothesis and its cited evidence and asked whether the evidence is sufficient, contradicted by other signals, or ambiguous. This self-critique pass filters out hallucinated causes that lack genuine signal support.The output is a list of Hypothesis objects, each carrying the proposed root cause description, the cited evidence excerpts, a raw LLM confidence score (0.0–1.0), and a verification verdict.
You can swap the underlying LLM by setting RCA_LLM_MODEL in your .env file. Any OpenAI-compatible model endpoint is supported. For cost-sensitive environments, gpt-4o-mini provides a reasonable balance of accuracy and token cost for most incident types.
4

Stage 4 — Ranking & Output

The final stage, rank_and_format, takes the list of Hypothesis objects from Stage 3, applies a deterministic scoring function, and produces the ranked output structure displayed in the UI.The confidence score for each hypothesis is computed from three weighted components:
ComponentWeightDescription
LLM confidence40%The raw 0.0–1.0 score assigned by the model in Pass 2
Evidence breadth35%Number of distinct signal sources (logs, metrics, traces) that support the hypothesis
Verification verdict25%Whether the Pass 3 self-critique confirmed, weakened, or rejected the hypothesis
Hypotheses that were rejected during verification are removed entirely from the output. The remaining hypotheses are sorted in descending order of their composite score and wrapped into a RankedAnalysisResult — the final output object returned to the Streamlit UI via the Celery result backend in Redis.Each entry in the ranked output includes:
  • A plain-English root cause description
  • The composite confidence score (expressed as a percentage in the UI)
  • Verbatim evidence excerpts: the specific log line, metric anomaly record, or trace span that supports the hypothesis
  • The source type and connector name for each piece of evidence, so engineers know exactly where to look next

Async Architecture: Celery and Redis

The pipeline is designed to be non-blocking from the first byte of data ingested to the final result written back to the UI. Every stage described above runs inside a Celery task rather than inline in the web process, which gives the system three important properties. Parallelism during ingestion. The ingest_signals task uses Celery’s group primitive to fan out to each data source connector concurrently. A Celery chord then waits for all connector sub-tasks to finish before the preprocess_signals task begins. This means a deployment with three connected observability backends completes ingestion in the time it takes the slowest one to respond, not three times that. UI responsiveness. The Streamlit front-end submits an analysis by calling apply_async on the ingest_signals task and receives a task ID immediately. It then polls the Celery result backend (stored in Redis database 1) on a short interval and updates the UI as each stage’s result becomes available. The UI never blocks on a synchronous call to the agent. Worker scalability. Because tasks are queued in Redis, you can run multiple Celery workers across multiple machines or containers to handle concurrent analyses from different engineers. Each worker pulls tasks from the same queue, and results are written back to the shared Redis result backend regardless of which worker executed the task.
Streamlit UI

    │  apply_async(ingest_signals, incident_context)

Redis Broker (db 0)

    │  task queued

Celery Worker
    ├─ ingest_signals ──────────────────────────────────────────┐
    │   ├─ connector_subtask(logs)     ─────────────────────┐   │
    │   ├─ connector_subtask(metrics)  ─────────────────────┤   │
    │   └─ connector_subtask(traces)   ─────────────────────┘   │
    │                                  (parallel, chord)         │
    │                                                            │
    ├─ preprocess_signals(raw_bundle)  ◄────────────────────────┘
    ├─ generate_hypotheses(normalized_bundle)
    └─ rank_and_format(hypotheses)

          │  result written

    Redis Result Backend (db 1)

          │  polled by UI

    Streamlit UI  ──►  Ranked Hypotheses Panel

Streamlit UI: From Submission to Ranked Output

The Streamlit front-end in app/ui.py acts as a thin orchestration shell around the Celery task chain. It handles three responsibilities: collecting incident context from the engineer, surfacing pipeline progress in real time, and presenting the ranked hypothesis output in an interactive layout. When the engineer clicks Run Analysis, the UI serializes the form inputs into an IncidentContext dict and calls celery_app.send_task('app.worker.tasks.ingest_signals', args=[incident_context]). It stores the returned AsyncResult object in Streamlit’s session state and begins a polling loop using st.empty() placeholders to update the UI without a full page reload. As each Celery stage completes, its result is written to the Redis result backend keyed by task ID. The UI polling loop reads these intermediate keys and updates status indicators — a spinner transitions to a checkmark for each completed stage, giving engineers visibility into where the pipeline is in real time. When rank_and_format completes, the UI reads the RankedAnalysisResult from the result backend and renders it as an ordered list of hypothesis cards. Each card shows the hypothesis title, confidence percentage, and a compact evidence summary. Expanding a card reveals the verbatim evidence excerpts and the originating data source connector, giving the engineer everything they need to either accept the hypothesis or drill deeper into the raw signals in their native observability tools.

Reference Documentation

Architecture Reference

Detailed component diagrams, the full Celery task graph, and notes on deploying the agent in production environments with multiple workers.

Data Sources Reference

Documentation for every built-in connector, the connector interface for building custom integrations, and connector-specific configuration options.

Build docs developers (and LLMs) love