Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/devops-root-cause-analysis-agent/llms.txt

Use this file to discover all available pages before exploring further.

The DevOps RCA Agent uses a .env file to manage runtime configuration. This approach keeps sensitive credentials out of source code and lets you maintain separate configurations for development, staging, and production environments without changing any application logic. Every variable listed here can also be set as a real OS environment variable — the agent reads os.environ with .env file values as fallback defaults via python-dotenv.

Getting Started

Copy the provided example file to create your local configuration:
cp .env.example .env
Then open .env in your editor and fill in the required values. A full example file with all supported variables is shown below.
.env
# ── AI / LLM ─────────────────────────────────────────────
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=                    # optional: override for Azure, Groq, Ollama, etc.
LLM_MODEL=gpt-4o
LLM_MAX_TOKENS=4096
LLM_TEMPERATURE=0.2

# ── Redis / Celery ────────────────────────────────────────
REDIS_URL=redis://localhost:6379/0
CELERY_CONCURRENCY=4
CELERY_TASK_TIMEOUT=300

# ── Application ───────────────────────────────────────────
LOG_LEVEL=INFO
ANALYSIS_WINDOW_MINUTES=60
MAX_HYPOTHESES=5

# ── Observability Integrations ────────────────────────────
PROMETHEUS_URL=http://prometheus:9090
ELASTICSEARCH_URL=http://elasticsearch:9200
JAEGER_QUERY_URL=http://jaeger:16686
LOKI_URL=http://loki:3100
Never commit your .env file to version control. It contains secrets such as API keys and database credentials. Ensure .env is listed in your .gitignore before running git add.

AI / LLM Settings

These variables control which language model the agent uses for hypothesis generation and how it samples responses. The agent constructs structured prompts from correlated signals and submits them to the configured LLM endpoint, so model choice directly affects analysis quality and latency.
OPENAI_API_KEY
string
required
The API key for your LLM provider. For OpenAI, retrieve this from platform.openai.com/api-keys. The agent also accepts keys for OpenAI-compatible endpoints such as Azure OpenAI, Groq, or a local Ollama instance — pair this with OPENAI_BASE_URL to point the client at the correct host.
OPENAI_BASE_URL
string
Override the base URL used for all LLM API calls. Leave unset to use the default OpenAI endpoint (https://api.openai.com/v1). Set this when routing requests through an OpenAI-compatible provider (e.g. https://api.groq.com/openai/v1 for Groq, or http://localhost:11434/v1 for a local Ollama instance).
LLM_MODEL
string
default:"gpt-4o"
The model identifier sent in every chat completion request. Use any model name recognized by your provider, such as gpt-4o, gpt-4-turbo, gpt-3.5-turbo, or a fine-tuned deployment name. Larger context models are recommended when ANALYSIS_WINDOW_MINUTES is high, as signal payloads can be substantial.
LLM_MAX_TOKENS
integer
default:"4096"
The maximum number of tokens the model may generate in a single response. Increase this if the agent is truncating hypothesis explanations for complex incidents. Decrease it to reduce cost on simpler analyses. This maps directly to the max_tokens parameter in the chat completions API.
LLM_TEMPERATURE
float
default:"0.2"
Sampling temperature for the model. A value of 0.0 makes responses fully deterministic; values approaching 1.0 introduce more variability. The default of 0.2 balances reproducibility with the natural-language fluency needed for hypothesis narratives. Avoid setting this above 0.5 in production — higher values can produce hallucinated correlations.

Redis / Celery Settings

The agent uses Celery backed by Redis to run analysis tasks asynchronously, allowing the Streamlit UI to remain responsive while correlations and LLM calls are processed in the background. Redis also serves as the result backend, storing completed analysis payloads until the UI polls for them.
REDIS_URL
string
required
Full connection URL for your Redis instance, used as both the Celery broker and the result backend. Supported schemes:
FormatExample
Localredis://localhost:6379/0
With passwordredis://:yourpassword@localhost:6379/0
TLS (Redis Cloud / Upstash)rediss://user:pass@host:6380/0
Database index 0 is the default; use a different index (e.g. /1) to isolate the agent’s queues from other applications sharing the same Redis instance.
CELERY_CONCURRENCY
integer
default:"4"
The number of worker processes (prefork pool) that Celery spawns. Each worker process can handle one analysis task at a time. Set this to the number of CPU cores available on your worker host for CPU-bound workloads, or higher (up to 2×) for I/O-bound tasks dominated by external API calls. Overridden at runtime with --concurrency when launching the worker manually.
CELERY_TASK_TIMEOUT
integer
default:"300"
Hard timeout in seconds for a single analysis task. If a task does not complete within this window — for example because an LLM call hangs or a datasource is unreachable — Celery terminates it and marks it as failed. The UI surfaces a timeout error with the task ID so the incident can be resubmitted. Increase this for analysis windows exceeding 120 minutes where data volumes are large.

Application Settings

These variables tune core application behavior, including how much historical data the agent fetches around an incident and how verbose its logging output is.
LOG_LEVEL
string
default:"INFO"
Controls the minimum severity level written to stdout and the log file. Accepted values (least to most verbose): ERROR, WARNING, INFO, DEBUG. Use DEBUG during local development to trace inter-component calls and prompt payloads. Set ERROR or WARNING in high-throughput production environments to reduce log volume.
ANALYSIS_WINDOW_MINUTES
integer
default:"60"
The number of minutes before and after an incident’s reported start time from which the agent fetches logs, metrics, and traces. A window of 60 means the agent queries data from [incident_time - 60m, incident_time + 60m] — a 2-hour span centered on the event. Widening this window improves detection of slow-building anomalies but increases query cost and LLM token usage.
MAX_HYPOTHESES
integer
default:"5"
The maximum number of ranked root-cause hypotheses the agent returns per analysis run. The LLM is explicitly prompted to generate no more than this many candidates, and the ranking step filters further based on min_confidence_score from agent.yaml. Setting this to 1 forces the agent to commit to a single best explanation, which can be useful in automated remediation pipelines.
All application settings can also be configured in agent.yaml (see Agent Settings). Environment variables always take precedence over file-based config, so you can keep sensible defaults in agent.yaml and use .env to override them per deployment.

Build docs developers (and LLMs) love