While environment variables handle secrets and infrastructure addresses, theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/devops-root-cause-analysis-agent/llms.txt
Use this file to discover all available pages before exploring further.
agent.yaml file at the project root controls how the agent thinks and behaves. This file is the single place to tune the analysis pipeline: which data sources are active, how hypotheses are scored and ranked, which LLM prompt templates are used, and how the Streamlit UI presents results. You can commit agent.yaml to version control safely — it contains no credentials, only structural and behavioral configuration.
Full Example Configuration
The following file shows all available sections and their default values. Copy this into your project root asagent.yaml and adjust as needed.
agent.yaml
analysis Section
The analysis block defines the scope and data inputs for each analysis run. These settings determine how broad a net the agent casts when gathering evidence before reasoning begins.
Minutes of telemetry to fetch on either side of the incident start time, producing a total window of
2 × window_minutes. Wider windows improve detection of gradual degradations and upstream causes but increase query volume and LLM token cost. Overridden by the ANALYSIS_WINDOW_MINUTES environment variable.Maximum number of root-cause hypotheses to generate and return per run. The LLM is instructed to produce at most this many candidates. After LLM generation, hypotheses below
min_confidence_score are dropped, so the final list may be shorter. Overridden by the MAX_HYPOTHESES environment variable.Minimum confidence score (0.0–1.0) for a hypothesis to appear in the final results. Hypotheses generated by the LLM are scored by the ranking step; any below this threshold are silently filtered. Lower this value to surface speculative hypotheses during investigations; raise it to reduce noise in automated pipelines.
List of data source identifiers to query during analysis. Valid values are:
prometheus, datadog, elasticsearch, loki, jaeger, otel. Only sources with corresponding credentials configured in .env will succeed at query time — listing a source here without credentials will log a warning and skip that source rather than failing the entire run.ranking Section
After the LLM produces candidate hypotheses, the ranking step assigns a composite confidence score to each one using weighted signal contributions. Adjusting these weights lets you express which signal types are most reliable in your environment.
Weight assigned to the log anomaly signal — error rate spikes, novel error message clusters, and sudden increases in log volume. In most environments, logs are the highest-fidelity source for diagnosing application-layer failures, which is why this weight is the largest by default.
Weight assigned to metric anomalies detected in Prometheus or Datadog — CPU saturation, memory pressure, latency percentile increases, and throughput drops. Metric signals are highly reliable for infrastructure-layer causes.
Weight assigned to trace-level evidence — error spans, high-latency service calls, and dependency failures visible in Jaeger or OTEL traces. Traces provide precise call-graph context but can be incomplete if sampling rates are low, hence the lower default weight.
The three weights must sum to exactly
1.0. The agent validates this constraint at startup and will refuse to start if the sum differs by more than a floating-point tolerance of 0.001. If you add custom signal types via the plugin system, adjust all three weights accordingly.llm Section
The llm block configures the language model integration at the application level. Values here can all be overridden by corresponding environment variables (LLM_MODEL, LLM_TEMPERATURE).
The model identifier used for hypothesis generation. Accepts any model name valid for your configured provider. Overridden by the
LLM_MODEL environment variable.Sampling temperature for the model during hypothesis generation. See the Environment Variables reference for detailed guidance on choosing a temperature value. Overridden by the
LLM_TEMPERATURE environment variable.Path to the system prompt template file, relative to the project root. This file defines the model’s role, reasoning approach, and output format constraints. See Prompt Templates below.
Path to the user prompt template file, relative to the project root. This file contains the Jinja2 template that formats the collected signals into the user message sent to the model. See Prompt Templates below.
ui Section
The ui block controls the Streamlit front-end. Streamlit is the interactive layer through which engineers submit incidents, monitor analysis progress, and explore hypothesis evidence.
The browser tab title and Streamlit page title displayed in the header. Customize this if you’re deploying the agent with a team- or product-specific name (e.g.
"Platform RCA — Payments Team").When
true, the UI renders an expandable Raw Signals panel on the analysis results page, showing the exact log excerpts, metric time-series JSON, and trace span lists that were passed to the LLM. Disable this for a cleaner read-only view in shared dashboards where raw data may be noisy or sensitive.The maximum number of supporting evidence snippets (log lines, metric data points, or trace spans) shown per hypothesis in the UI. Increasing this provides more context for investigation but can make the results page harder to scan. The full evidence set is always available via the API regardless of this display limit.
Prompt Templates
The agent uses two Jinja2 template files to construct LLM prompts. These files live in theprompts/ directory at the project root and are referenced by path in agent.yaml.
prompts/system.txt
The system prompt establishes the model’s identity, reasoning framework, and output contract. It instructs the model to act as a senior site reliability engineer, reason step-by-step from evidence to cause, and return structured JSON with hypothesis fields: title, description, confidence, and evidence_references.
Customize the system prompt to:
- Encode team-specific runbook conventions or terminology
- Add output format constraints for downstream automation (e.g. requiring a
remediation_stepsfield) - Tune the reasoning style (e.g. emphasizing five-whys methodology)
prompts/user.txt
The user prompt template is rendered at runtime for each analysis run. It receives a Jinja2 context object with the following variables:
| Variable | Type | Description |
|---|---|---|
incident | object | Incident metadata: id, title, start_time, severity |
log_anomalies | list | Top log error clusters and volume spikes |
metric_anomalies | list | Metric time-series with detected change points |
trace_errors | list | Error spans and slow call chains from Jaeger/OTEL |
analysis_window | object | start and end ISO timestamps for the query window |
Example user prompt template
Example user prompt template
prompts/user.txt