Troubleshooting

This guide covers the most common issues encountered when running Optio day-to-day. For each problem you’ll find symptoms, how to confirm the cause, and the steps to fix it. Start here: Open the Cluster view (/cluster) in the UI. It shows the live state of every repository pod, including pods that are stuck in provisioning, errored, or being terminated. This is the fastest way to get an overview before digging deeper.

Many pod-level issues are recorded in the pod_health_events table. Each row captures the event type (crashed, oom_killed, restarted, healthy, orphan_cleaned), the pod name, the affected repository, and a message. If you have direct database access, this table is often the quickest way to understand what happened to a pod.

Task issues

Tasks stuck in queued state

Symptoms: Tasks stay in queued for an unusually long time and never advance to provisioning or running.Diagnosis:

Check the global concurrency limit: the default is 5 concurrent tasks (OPTIO_MAX_CONCURRENT). If 5 tasks are already running or provisioning, new tasks will wait.
Check the per-repository concurrency limit in the repository settings (maxConcurrentTasks, default 2). If your repo already has 2 running tasks, additional tasks for that repo will queue.
Look for tasks that are stuck in provisioning or running indefinitely — these may need to be manually cancelled to free up slots.
Check the task worker logs for repeated re-queue messages, which indicate the concurrency ceiling is being hit on every poll cycle.

Fix:

Cancel any stuck provisioning or running tasks via the task detail page (Cancel task action).
If the queue is legitimately full, increase OPTIO_MAX_CONCURRENT in your Helm values or raise maxConcurrentTasks for the affected repository.
Use Bulk actions → Cancel all active from the tasks list if you need to clear the queue entirely and start fresh.

Tasks stuck in provisioning state

Symptoms: Tasks move from queued to provisioning but stay there — the agent never starts and no logs appear.Diagnosis:

Go to the Cluster view (/cluster) and check the pod state for the affected repository. A pod stuck in provisioning or error state will block all tasks for that repo.
Run kubectl get pods -n optio and look for pods in Pending, ImagePullBackOff, CrashLoopBackOff, or Error state.
Run kubectl describe pod <pod-name> -n optio and read the Events section for the root cause.
Check PVC availability: kubectl get pvc -n optio. A pod can stay pending indefinitely if its persistent volume claim cannot be bound.

Fix:

If the pod is in ImagePullBackOff, see the Pod won’t start / image pull errors section below.
If the PVC is stuck, check that your cluster has a default storage class and sufficient capacity.
Once the underlying pod issue is resolved, the task worker will retry automatically on the next poll cycle. You can also manually retry the task from its detail page.

Agent and authentication issues

Agent fails immediately with an auth error

Symptoms: A task transitions quickly from running to failed. The error message mentions an API key, authentication, or an invalid token.Diagnosis:

Check that the CLAUDE_AUTH_MODE secret is set to either api-key or oauth-token. If it is missing, the agent cannot determine how to authenticate.
For API key mode: confirm the ANTHROPIC_API_KEY secret exists in the Secrets page.
For OAuth token mode: confirm the CLAUDE_CODE_OAUTH_TOKEN secret exists. This token may have expired — OAuth tokens from claude setup-token can have limited scopes.
Call GET /api/auth/status to check the validity of the stored credentials.

Fix:

Add or update the missing secret via the Secrets page (/secrets).
For OAuth token mode, re-extract the token from the macOS Keychain using the one-liner in the setup wizard and paste it as the new CLAUDE_CODE_OAUTH_TOKEN secret value.
After updating secrets, retry the task from its detail page.

Pod and infrastructure issues

Pod won't start / image pull errors

Symptoms: A pod stays in Pending or ImagePullBackOff. Tasks for the affected repository never advance past provisioning.Diagnosis:

Run kubectl get pods -n optio and identify the pod with a non-Running status.
Run kubectl describe pod <pod-name> -n optio and read the Events section. An ImagePullBackOff event indicates the agent image cannot be fetched.
Verify the agent image exists: docker images | grep optio-agent.
Check the configured image pull policy. In local development, the policy must be Never so Kubernetes uses the locally built image instead of trying to pull from a registry.

Fix:

If using local images, ensure OPTIO_IMAGE_PULL_POLICY=Never is set (or agent.imagePullPolicy=Never in Helm values).
Rebuild the agent images: ./images/build.sh.
If deploying to a remote cluster, push the agent image to a container registry and set the pull policy to IfNotPresent or Always.
After fixing the image issue, delete the stuck pod record from the Cluster view and retry the task to trigger a fresh pod creation.

Pod OOM-killed or crashed

Symptoms: A running task suddenly fails with an error about the pod crashing or being killed. The task may have been running for a while before failing.Diagnosis:

Open the Cluster view (/cluster) and check the pod’s health history. OOM kills and crashes are recorded there.
Query the pod_health_events table for rows with event_type = 'oom_killed' or event_type = 'crashed' for the affected repository.
The cleanup worker (which runs every 60 seconds) automatically detects crashed pods, records the event, fails any tasks that were running on the dead pod, and removes the pod record so the next task recreates it cleanly.

Fix:

Increase the pod’s memory limit in the Helm chart or by using a higher-capacity image preset (e.g., upgrade from node to full).
Reduce the number of concurrent agents per pod (maxAgentsPerPod) to lower peak memory usage.
Once the resource limits are updated, retry the failed task. A new pod will be provisioned automatically.

Connectivity and logs

WebSocket connection drops / no live logs

Symptoms: The log viewer in a task’s detail page shows no output, or the connection indicator shows “disconnected.” Logs may appear after the task completes but not in real time.Diagnosis:

Live logs are streamed over WebSocket via a Redis pub/sub channel. Check that Redis is running and reachable from the API server.
Confirm REDIS_URL is correctly set in the API configuration.
Verify that the web application’s NEXT_PUBLIC_API_URL environment variable points to the correct API host and port. A mismatch will cause WebSocket connections to fail silently.
If your deployment is behind a load balancer or ingress, check that WebSocket upgrades (Upgrade: websocket) are not being stripped or blocked.

Fix:

Restart Redis if it is unhealthy and verify connectivity with redis-cli ping.
Update the NEXT_PUBLIC_API_URL in the web deployment and roll out the change.
For ingress-based deployments, add WebSocket-specific annotations (e.g., nginx.ingress.kubernetes.io/proxy-read-timeout).

CI and PR watcher

CI status never updates / PR watcher seems stuck

Symptoms: A task is in pr_opened state and the CI status badge never changes from pending, even after checks complete on GitHub. Review tasks are not being triggered automatically.Diagnosis:

The PR watcher polls GitHub every 30 seconds. If no updates are appearing after several minutes, the watcher may be stalled.
Check that the GITHUB_TOKEN secret is set and has not expired. The watcher needs a valid token to call the GitHub API.
Look at the API server logs for errors from the PR watcher worker (pr-watcher-worker.ts). Rate-limit errors from the GitHub API are a common cause.
Verify that the PR URL stored on the task is correct and that the PR is still open on GitHub.

Fix:

Rotate the GITHUB_TOKEN secret via the Secrets page if it has expired.
If the watcher is rate-limited, increase OPTIO_PR_WATCH_INTERVAL (default 30s) to reduce API call frequency.
To force an immediate PR status check, you can manually trigger a review via the task detail page (Request review action) or use POST /api/tasks/:id/review.

OAuth login fails

Symptoms: Clicking Sign in with GitHub (or Google/GitLab) redirects to the provider, but after authorizing you are sent back to an error page or the login page again. You may see an invalid_state error.Diagnosis:

Confirm that API_PUBLIC_URL and WEB_PUBLIC_URL are set to the actual public URLs of your deployment. These are used to construct the OAuth callback URL.
Check that the OAuth callback URL is registered with the provider. For GitHub, it should be {API_PUBLIC_URL}/api/auth/github/callback. An unregistered callback URL causes the provider to reject the flow.
An invalid_state error means the CSRF state token expired. This happens when more than 10 minutes pass between clicking the login button and completing the OAuth flow (e.g., a slow redirect or a cached login page).

Fix:

Update API_PUBLIC_URL and WEB_PUBLIC_URL in your Helm values or environment config to match the actual deployment URLs, then redeploy.
Register the correct callback URL in your OAuth application settings on the provider’s dashboard.
For invalid_state errors, ask the user to start the login flow again from a fresh browser tab.

Recovering stuck or failed tasks

Optio provides two recovery actions on any task’s detail page:

Action	What it does
Force restart	Starts a fresh agent session on the existing PR branch. Use this when the agent failed mid-run but the PR and branch are still valid.
Force redo	Clears all task state and re-runs from scratch — new branch, new worktree, new agent session. Use this when the branch or PR is corrupted, or you want a completely clean attempt.

Both actions are also available via the API:

POST /api/tasks/:id/force-restart
POST /api/tasks/:id/force-redo

Get Started

Core Features

Integrations

Operations

Troubleshooting

Task issues

Agent and authentication issues

Pod and infrastructure issues

Connectivity and logs

CI and PR watcher

Recovering stuck or failed tasks

Build docs developers (and LLMs) love

Get Started

Core Features

Integrations

Operations

Documentation Index

​Task issues

​Agent and authentication issues

​Pod and infrastructure issues

​Connectivity and logs

​CI and PR watcher

​Authentication and login

​Recovering stuck or failed tasks

Build docs developers (and LLMs) love

Task issues

Agent and authentication issues

Pod and infrastructure issues

Connectivity and logs

CI and PR watcher

Authentication and login

Recovering stuck or failed tasks