The DevOps RCA Agent is built around three guiding principles: asynchronous execution, modular composition, and pluggable data sources. Every stage of an analysis — from ingesting raw signals to producing ranked hypotheses — runs as an isolated, retriable unit of work. This design means the system scales horizontally by adding Celery workers, tolerates individual data source failures without aborting a full analysis, and can be extended with new connectors or LLM backends without touching core orchestration logic.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/devops-root-cause-analysis-agent/llms.txt
Use this file to discover all available pages before exploring further.
Component Overview
Agent Core
app/agent.py — The central orchestrator. Accepts analysis requests, fans out to enabled data source connectors, aggregates the returned signals, and drives the LLM reasoning pipeline.Connector Layer
app/connectors/ — A collection of pluggable data source adapters. Each connector implements the BaseConnector interface, exposing fetch_signals() and health_check() methods.Task Queue
Celery + Redis — Every pipeline stage is a discrete Celery task. Parallel
fetch_signals sub-tasks run concurrently across connectors, and results are aggregated without blocking the UI or the main process.LLM Client
app/llm.py — Wraps the OpenAI (or compatible) API. Manages prompt template rendering, automatic retries with exponential backoff, and token-budget enforcement to keep requests within model context limits.Streamlit UI
app/ui.py — The browser-based frontend. Provides forms for submitting analysis jobs (time window, context, data sources), a live task-status tracker, and a structured results viewer for browsing hypotheses.Result Store
Analysis results are persisted to Redis immediately after completion (short-term retention, configurable TTL). An optional database backend (PostgreSQL or SQLite) can be configured for longer-term storage and historical querying.
Agent Core
app/agent.py is the entry point for every analysis. When a request arrives — either from the Streamlit UI or directly via the Python API — the Agent Core validates the parameters, resolves which connectors are enabled, and dispatches the root run_analysis Celery task. After all sub-tasks complete, it merges the returned Signal objects, constructs the LLM prompt, and writes the final ranked hypotheses back to the result store.
Connector Layer
Each file underapp/connectors/ is a self-contained adapter for one data platform. All connectors inherit from BaseConnector:
app/connectors/__init__.py under CONNECTOR_REGISTRY, a plain dict keyed by connector name. Adding a new data source is as simple as implementing BaseConnector and adding an entry to the registry.
Task Queue (Celery + Redis)
The task queue decouples the UI from long-running data fetches and LLM calls. The key tasks defined inapp/worker.py are:
| Task | Description |
|---|---|
run_analysis | Top-level task; coordinates the full pipeline for one analysis request |
fetch_signals | Per-connector sub-task; calls connector.fetch_signals() and returns normalized results |
aggregate_signals | Merges Signal lists from all completed fetch_signals tasks |
call_llm | Renders the prompt, calls the LLM, and parses the structured JSON response |
rank_hypotheses | Sorts hypotheses by confidence score and writes results to the store |
chord / group primitives).
LLM Client
app/llm.py centralizes all LLM interactions. It loads prompt templates from app/prompts/, injects the aggregated signal summary, and calls the configured API endpoint. Responses are expected in a structured JSON schema:
LLM_MAX_TOKENS limit, prioritizing higher-severity signals.
Streamlit UI
app/ui.py renders two primary views:
- Submit Analysis — a form collecting the time window (start/end or relative minutes), free-text incident context, and a multi-select of enabled data sources.
- Results Browser — polls the Celery result backend every two seconds and renders each hypothesis as an expandable card with confidence score, supporting signals, and recommended action.
Data Flow
The end-to-end flow from user request to ranked root-cause hypotheses proceeds through six stages.User Submits an Analysis Request
The user fills in the Streamlit form (or calls
Agent.run() directly) with a time window, optional incident description, and the set of data sources to query. The request is validated and a unique analysis_id is generated.Agent Core Enqueues the Pipeline
app/agent.py dispatches a run_analysis Celery task carrying the validated request payload. Control returns immediately to the UI, which begins polling for status using the analysis_id.Parallel Signal Fetching
The Celery worker executes a
group of fetch_signals sub-tasks — one per enabled connector — in parallel. Each sub-task contacts its upstream data source, queries the relevant time window, and returns a list of normalized Signal objects.Signal Aggregation
Once all
fetch_signals tasks complete (via a Celery chord), the aggregate_signals task merges all Signal lists, deduplicates overlapping entries, and attaches severity weights based on signal type and anomaly score.LLM Reasoning
The aggregated, weighted signal set is serialized into a structured prompt and sent to the LLM via
app/llm.py. The model returns candidate root-cause hypotheses as structured JSON, each with a confidence score, a plain-language summary, a list of supporting signal IDs, and a recommended remediation action.Deployment Topology
The recommended deployment uses three Docker services: the Streamlit application, a Celery worker pool, and Redis. All services share environment configuration via an.env file.
Scaling the worker pool
Scaling the worker pool
Run multiple
worker replicas behind a shared Redis broker to increase analysis throughput. Each worker should be assigned to a dedicated Celery queue (e.g., fetch, llm) so that slow LLM calls do not starve data-fetch tasks.Persisting results beyond Redis TTL
Persisting results beyond Redis TTL
Set
RESULT_BACKEND_DB_URL to a PostgreSQL or SQLite connection string to enable the optional database result store. Redis results are still written for fast polling, but the database copy persists indefinitely and supports historical queries.Running behind a reverse proxy
Running behind a reverse proxy
When placing the Streamlit UI behind Nginx or a cloud load balancer, set
--server.baseUrlPath to match your path prefix and ensure WebSocket connections are forwarded, as Streamlit’s live-update mechanism relies on them.For a full reference of all environment variables consumed by each service, see the Environment Variable Reference.