RCA Agent System Architecture and Component Overview

The DevOps RCA Agent is built around three guiding principles: asynchronous execution, modular composition, and pluggable data sources. Every stage of an analysis — from ingesting raw signals to producing ranked hypotheses — runs as an isolated, retriable unit of work. This design means the system scales horizontally by adding Celery workers, tolerates individual data source failures without aborting a full analysis, and can be extended with new connectors or LLM backends without touching core orchestration logic.

Component Overview

Agent Core

app/agent.py — The central orchestrator. Accepts analysis requests, fans out to enabled data source connectors, aggregates the returned signals, and drives the LLM reasoning pipeline.

Connector Layer

app/connectors/ — A collection of pluggable data source adapters. Each connector implements the BaseConnector interface, exposing fetch_signals() and health_check() methods.

Task Queue

Celery + Redis — Every pipeline stage is a discrete Celery task. Parallel fetch_signals sub-tasks run concurrently across connectors, and results are aggregated without blocking the UI or the main process.

LLM Client

app/llm.py — Wraps the OpenAI (or compatible) API. Manages prompt template rendering, automatic retries with exponential backoff, and token-budget enforcement to keep requests within model context limits.

Streamlit UI

app/ui.py — The browser-based frontend. Provides forms for submitting analysis jobs (time window, context, data sources), a live task-status tracker, and a structured results viewer for browsing hypotheses.

Result Store

Analysis results are persisted to Redis immediately after completion (short-term retention, configurable TTL). An optional database backend (PostgreSQL or SQLite) can be configured for longer-term storage and historical querying.

Agent Core

app/agent.py is the entry point for every analysis. When a request arrives — either from the Streamlit UI or directly via the Python API — the Agent Core validates the parameters, resolves which connectors are enabled, and dispatches the root run_analysis Celery task. After all sub-tasks complete, it merges the returned Signal objects, constructs the LLM prompt, and writes the final ranked hypotheses back to the result store.

Connector Layer

Each file under app/connectors/ is a self-contained adapter for one data platform. All connectors inherit from BaseConnector:

# app/connectors/base.py
from abc import ABC, abstractmethod
from datetime import datetime
from typing import List
from app.models import Signal

class BaseConnector(ABC):
    """Abstract base class for all RCA data source connectors."""

    @abstractmethod
    def fetch_signals(
        self,
        start_time: datetime,
        end_time: datetime,
        context: str = "",
    ) -> List[Signal]:
        """Fetch normalized signals for the given time window."""
        ...

    @abstractmethod
    def health_check(self) -> bool:
        """Return True if the upstream data source is reachable."""
        ...

Connectors are registered in app/connectors/__init__.py under CONNECTOR_REGISTRY, a plain dict keyed by connector name. Adding a new data source is as simple as implementing BaseConnector and adding an entry to the registry.

Task Queue (Celery + Redis)

The task queue decouples the UI from long-running data fetches and LLM calls. The key tasks defined in app/worker.py are:

Task	Description
`run_analysis`	Top-level task; coordinates the full pipeline for one analysis request
`fetch_signals`	Per-connector sub-task; calls `connector.fetch_signals()` and returns normalized results
`aggregate_signals`	Merges Signal lists from all completed `fetch_signals` tasks
`call_llm`	Renders the prompt, calls the LLM, and parses the structured JSON response
`rank_hypotheses`	Sorts hypotheses by confidence score and writes results to the store

Redis serves a dual role: it is both the message broker (task routing) and the result backend (storing return values for chord / group primitives).

LLM Client

app/llm.py centralizes all LLM interactions. It loads prompt templates from app/prompts/, injects the aggregated signal summary, and calls the configured API endpoint. Responses are expected in a structured JSON schema:

{
  "hypotheses": [
    {
      "id": "h1",
      "summary": "Memory pressure on node-3 caused OOM kills in the payments service",
      "confidence": 0.87,
      "supporting_signals": ["metric:node_memory_MemAvailable", "log:oom_killer"],
      "recommended_action": "Scale node-3 vertically or reduce JVM heap allocation"
    }
  ]
}

Token budgeting trims the signal payload if the estimated prompt length would exceed the configured LLM_MAX_TOKENS limit, prioritizing higher-severity signals.

Streamlit UI

app/ui.py renders two primary views:

Submit Analysis — a form collecting the time window (start/end or relative minutes), free-text incident context, and a multi-select of enabled data sources.
Results Browser — polls the Celery result backend every two seconds and renders each hypothesis as an expandable card with confidence score, supporting signals, and recommended action.

Data Flow

The end-to-end flow from user request to ranked root-cause hypotheses proceeds through six stages.

User Submits an Analysis Request

The user fills in the Streamlit form (or calls Agent.run() directly) with a time window, optional incident description, and the set of data sources to query. The request is validated and a unique analysis_id is generated.

Agent Core Enqueues the Pipeline

app/agent.py dispatches a run_analysis Celery task carrying the validated request payload. Control returns immediately to the UI, which begins polling for status using the analysis_id.

Parallel Signal Fetching

The Celery worker executes a group of fetch_signals sub-tasks — one per enabled connector — in parallel. Each sub-task contacts its upstream data source, queries the relevant time window, and returns a list of normalized Signal objects.

Signal Aggregation

Once all fetch_signals tasks complete (via a Celery chord), the aggregate_signals task merges all Signal lists, deduplicates overlapping entries, and attaches severity weights based on signal type and anomaly score.

LLM Reasoning

The aggregated, weighted signal set is serialized into a structured prompt and sent to the LLM via app/llm.py. The model returns candidate root-cause hypotheses as structured JSON, each with a confidence score, a plain-language summary, a list of supporting signal IDs, and a recommended remediation action.

Ranking, Storage, and Display

Hypotheses are sorted by descending confidence score. The final result payload is written to the result store under the analysis_id key. The Streamlit UI detects the completed status on its next poll and renders the ranked hypothesis list.

Deployment Topology

The recommended deployment uses three Docker services: the Streamlit application, a Celery worker pool, and Redis. All services share environment configuration via an .env file.

# docker-compose.yml
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  worker:
    build: .
    command: celery -A app.worker worker --loglevel=info --concurrency=4
    env_file: .env
    depends_on:
      - redis

  app:
    build: .
    command: streamlit run app/ui.py --server.port 8501
    ports:
      - "8501:8501"
    env_file: .env
    depends_on:
      - redis

For production deployments, consider the following adjustments:

Scaling the worker pool

Run multiple worker replicas behind a shared Redis broker to increase analysis throughput. Each worker should be assigned to a dedicated Celery queue (e.g., fetch, llm) so that slow LLM calls do not starve data-fetch tasks.

# Dedicated fetch queue — high concurrency
celery -A app.worker worker -Q fetch --concurrency=8

# Dedicated LLM queue — lower concurrency, higher timeout
celery -A app.worker worker -Q llm --concurrency=2

Persisting results beyond Redis TTL

Set RESULT_BACKEND_DB_URL to a PostgreSQL or SQLite connection string to enable the optional database result store. Redis results are still written for fast polling, but the database copy persists indefinitely and supports historical queries.

Running behind a reverse proxy

When placing the Streamlit UI behind Nginx or a cloud load balancer, set --server.baseUrlPath to match your path prefix and ensure WebSocket connections are forwarded, as Streamlit’s live-update mechanism relies on them.

For a full reference of all environment variables consumed by each service, see the Environment Variable Reference.

Get Started

Configuration

Guides

Reference

Component Overview

Agent Core

Connector Layer

Task Queue

LLM Client

Streamlit UI

Result Store

Agent Core

Connector Layer

Task Queue (Celery + Redis)

LLM Client

Streamlit UI

Data Flow

Deployment Topology

Build docs developers (and LLMs) love

Get Started

Configuration

Guides

Reference

Documentation Index

​Component Overview

Agent Core

Connector Layer

Task Queue

LLM Client

Streamlit UI

Result Store

​Agent Core

​Connector Layer

​Task Queue (Celery + Redis)

​LLM Client

​Streamlit UI

​Data Flow

​Deployment Topology

Build docs developers (and LLMs) love

Component Overview

Agent Core

Connector Layer

Task Queue (Celery + Redis)

LLM Client

Streamlit UI

Data Flow

Deployment Topology