Troubleshooting Common RCA Agent Issues and Errors

When something goes wrong, the most effective first step is to check three sources in order: the service logs (Celery worker stdout, Streamlit stderr), the connector health endpoints (which surface data source reachability independently of a full analysis run), and the Celery task state stored in Redis. Most failures fall into one of three categories — startup problems, analysis-time failures, or connectivity issues — and the sections below address each with targeted diagnostic commands and configuration fixes.

Startup Issues
Analysis Issues
Connectivity Issues

Celery worker fails to start

The most common cause is a missing or malformed REDIS_URL environment variable. The worker process exits immediately if it cannot connect to the Redis broker during startup.DiagnosisFirst, confirm that Redis is reachable from the worker container or host:

redis-cli -u "$REDIS_URL" ping
# Expected output: PONG

Then verify the worker can register itself with the broker:

celery -A app.worker inspect ping
# Expected output: {"celery@<hostname>": {"ok": "pong"}}

If inspect ping times out with no response, the worker process has not started or cannot reach the broker.Common causes and fixes

REDIS_URL is not set — add it to your .env file: REDIS_URL=redis://localhost:6379/0
Redis is not running — start it with docker compose up redis or redis-server
Firewall or network policy blocking port 6379 between the worker and Redis containers
CELERY_BROKER_URL is set to a different value than REDIS_URL — ensure both point to the same Redis instance

Never omit the database index suffix (e.g., /0) from REDIS_URL. Some Redis client libraries treat a bare redis://host:port URL differently from redis://host:port/0, which can cause the broker and result backend to use different databases and break task result retrieval.

Streamlit UI shows a blank page

A blank page at http://localhost:8501 usually means Streamlit started but encountered an import or runtime error before rendering, or the browser is connecting before the server is ready.DiagnosisCheck the terminal output where streamlit run was launched for Python tracebacks. Streamlit prints errors to stderr before the server begins accepting connections:

streamlit run app/ui.py --server.port 8501 2>&1 | head -50

Verify that port 8501 is not already bound by another process:

lsof -i :8501
# Should return no output if the port is free

Fixes

If a traceback is present, resolve the underlying Python error (see the ImportError accordion below).
If the port is in use, stop the conflicting process or change the port: streamlit run app/ui.py --server.port 8502
Clear Streamlit’s module and data cache if a stale cache is causing unexpected state:

streamlit cache clear

On some Linux hosts, Streamlit’s file watcher uses inotify. If you see “inotify watch limit reached”, increase the system limit:

echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

ImportError or ModuleNotFoundError on startup

This error means Python cannot locate one or more packages that the agent depends on. It most often occurs when the virtual environment is not activated or dependencies were never installed.Diagnosis

# Confirm the active Python environment
which python
pip list | grep -E "celery|streamlit|openai|redis"

If the grep returns no output, the packages are missing from the current environment.FixesActivate the virtual environment and install dependencies:

source .venv/bin/activate          # Linux / macOS
# .venv\Scripts\activate           # Windows

pip install -r requirements.txt

If you are using Docker and see this error inside the container, the image may have been built before requirements.txt was last updated. Rebuild without cache:

docker compose build --no-cache worker app

Always pin your dependency versions in requirements.txt (e.g., celery==5.3.6) to avoid silent breakage when upstream packages release incompatible updates.

Analysis returns no hypotheses

When the analysis completes but the results page shows no hypotheses, it almost always means no signals were successfully fetched — the LLM received an empty or near-empty signal payload and could not identify any patterns.DiagnosisRun a manual health check on all registered connectors to see which ones are reachable:

from app.connectors import CONNECTOR_REGISTRY

for name, cls in CONNECTOR_REGISTRY.items():
    connector = cls()
    status = "OK" if connector.health_check() else "FAIL"
    print(f"{name}: {status}")

Also check whether any data actually exists in the configured sources for the requested time window. A common mistake is querying a time range in the distant past for which log retention has already expired.Fixes

Fix any connectors reporting FAIL (see the Connectivity Issues tab).
Widen the analysis window by increasing ANALYSIS_WINDOW_MINUTES in your .env file.
Verify data exists in the source for the exact time range you specified — query Prometheus or Elasticsearch directly using their native UIs to confirm.
Enable at least two data source types (metric + log) to give the LLM enough corroborating evidence to generate hypotheses.

All confidence scores are very low

Confidence scores below 0.3 across all hypotheses indicate that the LLM found some signal anomalies but could not correlate them into a coherent root-cause narrative. This is usually a data density problem, not an LLM problem.DiagnosisInspect the raw signal payload logged at DEBUG level (see Enabling Debug Logging below). Look for:

Only one signal type present (e.g., only metric, no log or trace)
Very few total signals (fewer than ~10 data points)
Signals spread across a very long time window with no obvious clustering

Fixes

Enable additional data source connectors to increase signal diversity. Cross-source correlation is the primary driver of high confidence scores.
Lower MIN_CONFIDENCE_SCORE in your .env file to surface lower-certainty hypotheses that might still be actionable.
Verify that LLM_TEMPERATURE is set to a low value (recommended: 0.2–0.4). A high temperature causes the model to generate more speculative, lower-confidence output.

# Recommended LLM settings for deterministic, high-confidence output
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=4096

Analysis task times out

Task timeouts occur when one or more fetch_signals sub-tasks or the call_llm task exceed the configured time limit. Celery marks the task as FAILURE with a TimeLimitExceeded exception.DiagnosisCheck the Celery worker logs for the specific task that timed out:

docker compose logs worker | grep -E "TimeLimitExceeded|FAILURE|soft time limit"

The log line will include the task name and the connector or pipeline stage responsible.Fixes

Increase CELERY_TASK_TIMEOUT (hard limit, in seconds) and CELERY_TASK_SOFT_TIMEOUT (soft limit that triggers a graceful retry) in .env.
Reduce the number of enabled connectors for the analysis, or shorten the time window to reduce data volume.
Investigate data source API latency independently — a slow Elasticsearch cluster or an overloaded Prometheus server will time out the corresponding connector task even with generous limits.

# Measure Prometheus query latency directly
time curl -s "$PROMETHEUS_URL/api/v1/query?query=up" | jq '.status'

Setting CELERY_TASK_TIMEOUT above 300 seconds is not recommended in production. If analyses routinely take longer than five minutes, the correct fix is to reduce the analysis scope or scale the worker pool, not to raise the timeout further.

Redis connection refused

A Connection refused error on redis://host:6379 means Redis is not running, or the host/port in REDIS_URL does not match where Redis is actually listening.DiagnosisVerify the Redis container is running:

docker ps --filter name=redis
# Should show a redis container with status "Up"

Test the connection directly:

redis-cli ping
# If REDIS_URL is non-default:
redis-cli -u "$REDIS_URL" ping

Fixes

Start Redis: docker compose up -d redis
Confirm REDIS_URL uses the correct format: redis://host:port/db — for example, redis://localhost:6379/0
If running in Docker Compose, ensure the service name matches: redis://redis:6379/0 (not localhost), since containers resolve each other by service name on the shared network.

The db component of the URL (the trailing /0) selects the Redis logical database. The broker and result backend must use the same database index, or task results will be invisible to the worker that dispatched them.

LLM API returns 401 Unauthorized

A 401 Unauthorized response from the LLM API means the OPENAI_API_KEY (or equivalent) is missing, malformed, or has been revoked.DiagnosisCheck that the key is present in the running process environment:

echo $OPENAI_API_KEY
# Should print your key starting with "sk-..."

Test the key directly against the API:

curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -s | jq '.error // "OK"'

Fixes

Add or update OPENAI_API_KEY in your .env file and restart all services.
Verify that the key has not exceeded its usage quota in the OpenAI dashboard.
If using a compatible API (e.g., Azure OpenAI, Ollama), confirm that LLM_BASE_URL is set to the correct endpoint and that LLM_API_KEY is used in place of OPENAI_API_KEY where applicable.

Never commit your API key to version control. Use .env files (listed in .gitignore) or a secrets manager. If a key is accidentally exposed, rotate it immediately in the provider dashboard.

Connector health check fails

If connector.health_check() returns False, the connector cannot reach its upstream data source. This will cause the corresponding fetch_signals sub-task to be skipped, reducing the signal set available to the LLM.DiagnosisRun health checks for all registered connectors programmatically:

from app.connectors import CONNECTOR_REGISTRY

for name, cls in CONNECTOR_REGISTRY.items():
    connector = cls()
    status = "OK" if connector.health_check() else "FAIL"
    print(f"{name}: {status}")

For each connector reporting FAIL, test the upstream URL directly:

# Prometheus
curl -s "$PROMETHEUS_URL/-/healthy"

# Elasticsearch
curl -s -u "$ELASTICSEARCH_USERNAME:$ELASTICSEARCH_PASSWORD" \
  "$ELASTICSEARCH_URL/_cluster/health" | jq '.status'

# Loki
curl -s "$LOKI_URL/ready"

# Jaeger
curl -s "$JAEGER_QUERY_URL/api/services" | jq 'length'

Fixes

Ensure the relevant service is running and the URL in the corresponding environment variable is correct.
Check network reachability from the worker container to the data source (DNS resolution, firewall rules, VPN).
For TLS-protected endpoints, verify the certificate chain is trusted by the worker’s Python environment. Set REQUESTS_CA_BUNDLE to your custom CA bundle path if needed.

Enabling Debug Logging

Detailed debug output is the fastest way to understand exactly what the agent is doing at each pipeline stage. Enable it for the Celery worker and the Streamlit UI independently. Celery worker with debug logging:

LOG_LEVEL=DEBUG celery -A app.worker worker --loglevel=debug

Streamlit UI with debug logging:

LOG_LEVEL=DEBUG streamlit run app/ui.py

At DEBUG level, the worker logs the full signal payload before it is sent to the LLM, the rendered prompt template, and the raw LLM JSON response. This makes it straightforward to verify that the correct signals are being fetched and that the LLM prompt is populated as expected.

Pipe debug output to a file for easier analysis: LOG_LEVEL=DEBUG celery -A app.worker worker --loglevel=debug 2>&1 | tee worker-debug.log. You can then grep for specific task IDs or connector names without scrolling through the live terminal.

If none of the steps above resolve your issue, open a GitHub issue in the repository and attach the debug log output (redact any API keys or secrets). Include the analysis ID, the list of enabled connectors, and the environment variable names (not values) that are set in your .env file.

Get Started

Configuration

Guides

Reference

Enabling Debug Logging

Build docs developers (and LLMs) love

Get Started

Configuration

Guides

Reference

Documentation Index

​Enabling Debug Logging

Build docs developers (and LLMs) love

Enabling Debug Logging