Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/devops-root-cause-analysis-agent/llms.txt

Use this file to discover all available pages before exploring further.

While environment variables handle secrets and infrastructure addresses, the agent.yaml file at the project root controls how the agent thinks and behaves. This file is the single place to tune the analysis pipeline: which data sources are active, how hypotheses are scored and ranked, which LLM prompt templates are used, and how the Streamlit UI presents results. You can commit agent.yaml to version control safely — it contains no credentials, only structural and behavioral configuration.
Environment variables always take precedence over values in agent.yaml. This means you can commit sensible defaults in agent.yaml and override specific keys per deployment using your .env file or CI/CD secrets — without modifying the YAML. For example, setting LLM_MODEL=gpt-3.5-turbo in .env overrides llm.model: gpt-4o in agent.yaml.

Full Example Configuration

The following file shows all available sections and their default values. Copy this into your project root as agent.yaml and adjust as needed.
agent.yaml
analysis:
  window_minutes: 60
  max_hypotheses: 5
  min_confidence_score: 0.3
  sources:            # valid values: prometheus, datadog, elasticsearch, loki, jaeger, otel
    - prometheus
    - elasticsearch
    - jaeger

ranking:
  weights:
    log_anomaly: 0.4
    metric_spike: 0.35
    trace_error_rate: 0.25

llm:
  model: gpt-4o
  temperature: 0.2
  system_prompt_template: prompts/system.txt
  user_prompt_template: prompts/user.txt

ui:
  page_title: "RCA Agent"
  show_raw_signals: true
  max_evidence_snippets: 10

analysis Section

The analysis block defines the scope and data inputs for each analysis run. These settings determine how broad a net the agent casts when gathering evidence before reasoning begins.
analysis.window_minutes
integer
default:"60"
Minutes of telemetry to fetch on either side of the incident start time, producing a total window of 2 × window_minutes. Wider windows improve detection of gradual degradations and upstream causes but increase query volume and LLM token cost. Overridden by the ANALYSIS_WINDOW_MINUTES environment variable.
analysis.max_hypotheses
integer
default:"5"
Maximum number of root-cause hypotheses to generate and return per run. The LLM is instructed to produce at most this many candidates. After LLM generation, hypotheses below min_confidence_score are dropped, so the final list may be shorter. Overridden by the MAX_HYPOTHESES environment variable.
analysis.min_confidence_score
float
default:"0.3"
Minimum confidence score (0.0–1.0) for a hypothesis to appear in the final results. Hypotheses generated by the LLM are scored by the ranking step; any below this threshold are silently filtered. Lower this value to surface speculative hypotheses during investigations; raise it to reduce noise in automated pipelines.
analysis.sources
array
default:"[prometheus, elasticsearch, jaeger]"
List of data source identifiers to query during analysis. Valid values are: prometheus, datadog, elasticsearch, loki, jaeger, otel. Only sources with corresponding credentials configured in .env will succeed at query time — listing a source here without credentials will log a warning and skip that source rather than failing the entire run.

ranking Section

After the LLM produces candidate hypotheses, the ranking step assigns a composite confidence score to each one using weighted signal contributions. Adjusting these weights lets you express which signal types are most reliable in your environment.
ranking.weights.log_anomaly
float
default:"0.4"
Weight assigned to the log anomaly signal — error rate spikes, novel error message clusters, and sudden increases in log volume. In most environments, logs are the highest-fidelity source for diagnosing application-layer failures, which is why this weight is the largest by default.
ranking.weights.metric_spike
float
default:"0.35"
Weight assigned to metric anomalies detected in Prometheus or Datadog — CPU saturation, memory pressure, latency percentile increases, and throughput drops. Metric signals are highly reliable for infrastructure-layer causes.
ranking.weights.trace_error_rate
float
default:"0.25"
Weight assigned to trace-level evidence — error spans, high-latency service calls, and dependency failures visible in Jaeger or OTEL traces. Traces provide precise call-graph context but can be incomplete if sampling rates are low, hence the lower default weight.
The three weights must sum to exactly 1.0. The agent validates this constraint at startup and will refuse to start if the sum differs by more than a floating-point tolerance of 0.001. If you add custom signal types via the plugin system, adjust all three weights accordingly.

llm Section

The llm block configures the language model integration at the application level. Values here can all be overridden by corresponding environment variables (LLM_MODEL, LLM_TEMPERATURE).
llm.model
string
default:"gpt-4o"
The model identifier used for hypothesis generation. Accepts any model name valid for your configured provider. Overridden by the LLM_MODEL environment variable.
llm.temperature
float
default:"0.2"
Sampling temperature for the model during hypothesis generation. See the Environment Variables reference for detailed guidance on choosing a temperature value. Overridden by the LLM_TEMPERATURE environment variable.
llm.system_prompt_template
string
default:"prompts/system.txt"
Path to the system prompt template file, relative to the project root. This file defines the model’s role, reasoning approach, and output format constraints. See Prompt Templates below.
llm.user_prompt_template
string
default:"prompts/user.txt"
Path to the user prompt template file, relative to the project root. This file contains the Jinja2 template that formats the collected signals into the user message sent to the model. See Prompt Templates below.

ui Section

The ui block controls the Streamlit front-end. Streamlit is the interactive layer through which engineers submit incidents, monitor analysis progress, and explore hypothesis evidence.
ui.page_title
string
default:"RCA Agent"
The browser tab title and Streamlit page title displayed in the header. Customize this if you’re deploying the agent with a team- or product-specific name (e.g. "Platform RCA — Payments Team").
ui.show_raw_signals
boolean
default:"true"
When true, the UI renders an expandable Raw Signals panel on the analysis results page, showing the exact log excerpts, metric time-series JSON, and trace span lists that were passed to the LLM. Disable this for a cleaner read-only view in shared dashboards where raw data may be noisy or sensitive.
ui.max_evidence_snippets
integer
default:"10"
The maximum number of supporting evidence snippets (log lines, metric data points, or trace spans) shown per hypothesis in the UI. Increasing this provides more context for investigation but can make the results page harder to scan. The full evidence set is always available via the API regardless of this display limit.

Prompt Templates

The agent uses two Jinja2 template files to construct LLM prompts. These files live in the prompts/ directory at the project root and are referenced by path in agent.yaml.

prompts/system.txt

The system prompt establishes the model’s identity, reasoning framework, and output contract. It instructs the model to act as a senior site reliability engineer, reason step-by-step from evidence to cause, and return structured JSON with hypothesis fields: title, description, confidence, and evidence_references. Customize the system prompt to:
  • Encode team-specific runbook conventions or terminology
  • Add output format constraints for downstream automation (e.g. requiring a remediation_steps field)
  • Tune the reasoning style (e.g. emphasizing five-whys methodology)

prompts/user.txt

The user prompt template is rendered at runtime for each analysis run. It receives a Jinja2 context object with the following variables:
VariableTypeDescription
incidentobjectIncident metadata: id, title, start_time, severity
log_anomalieslistTop log error clusters and volume spikes
metric_anomalieslistMetric time-series with detected change points
trace_errorslistError spans and slow call chains from Jaeger/OTEL
analysis_windowobjectstart and end ISO timestamps for the query window
prompts/user.txt
You are analyzing incident: {{ incident.title }} (severity: {{ incident.severity }})
Analysis window: {{ analysis_window.start }}{{ analysis_window.end }}

## Log Anomalies
{% for anomaly in log_anomalies %}
- [{{ anomaly.service }}] {{ anomaly.message_pattern }} ({{ anomaly.count }} occurrences)
{% endfor %}

## Metric Anomalies
{% for metric in metric_anomalies %}
- {{ metric.name }}: baseline {{ metric.baseline }} → spike {{ metric.peak }} at {{ metric.peak_time }}
{% endfor %}

## Trace Errors
{% for span in trace_errors %}
- {{ span.service }}{{ span.operation }}: {{ span.error_message }} (duration: {{ span.duration_ms }}ms)
{% endfor %}

Based on the evidence above, identify up to {{ max_hypotheses }} root cause hypotheses.
Return a JSON array of hypothesis objects with fields: title, description, confidence, evidence_references.
Modifying the output format specification in prompts/system.txt — such as adding or renaming JSON fields — requires corresponding updates to the agent’s hypothesis parser. Mismatched field names will cause parsing errors and failed analysis runs.

Build docs developers (and LLMs) love