This guide covers the most common issues encountered when running Optio day-to-day. For each problem you’ll find symptoms, how to confirm the cause, and the steps to fix it. Start here: Open the Cluster view (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jonwiggins/optio/llms.txt
Use this file to discover all available pages before exploring further.
/cluster) in the UI. It shows the live state of every repository pod, including pods that are stuck in provisioning, errored, or being terminated. This is the fastest way to get an overview before digging deeper.
Many pod-level issues are recorded in the
pod_health_events table. Each row captures the event type (crashed, oom_killed, restarted, healthy, orphan_cleaned), the pod name, the affected repository, and a message. If you have direct database access, this table is often the quickest way to understand what happened to a pod.Task issues
Tasks stuck in queued state
Tasks stuck in queued state
Symptoms: Tasks stay in
queued for an unusually long time and never advance to provisioning or running.Diagnosis:- Check the global concurrency limit: the default is 5 concurrent tasks (
OPTIO_MAX_CONCURRENT). If 5 tasks are alreadyrunningorprovisioning, new tasks will wait. - Check the per-repository concurrency limit in the repository settings (
maxConcurrentTasks, default 2). If your repo already has 2 running tasks, additional tasks for that repo will queue. - Look for tasks that are stuck in
provisioningorrunningindefinitely — these may need to be manually cancelled to free up slots. - Check the task worker logs for repeated
re-queuemessages, which indicate the concurrency ceiling is being hit on every poll cycle.
- Cancel any stuck
provisioningorrunningtasks via the task detail page (Cancel task action). - If the queue is legitimately full, increase
OPTIO_MAX_CONCURRENTin your Helm values or raisemaxConcurrentTasksfor the affected repository. - Use Bulk actions → Cancel all active from the tasks list if you need to clear the queue entirely and start fresh.
Tasks stuck in provisioning state
Tasks stuck in provisioning state
Symptoms: Tasks move from
queued to provisioning but stay there — the agent never starts and no logs appear.Diagnosis:- Go to the Cluster view (
/cluster) and check the pod state for the affected repository. A pod stuck inprovisioningorerrorstate will block all tasks for that repo. - Run
kubectl get pods -n optioand look for pods inPending,ImagePullBackOff,CrashLoopBackOff, orErrorstate. - Run
kubectl describe pod <pod-name> -n optioand read the Events section for the root cause. - Check PVC availability:
kubectl get pvc -n optio. A pod can stay pending indefinitely if its persistent volume claim cannot be bound.
- If the pod is in
ImagePullBackOff, see the Pod won’t start / image pull errors section below. - If the PVC is stuck, check that your cluster has a default storage class and sufficient capacity.
- Once the underlying pod issue is resolved, the task worker will retry automatically on the next poll cycle. You can also manually retry the task from its detail page.
Agent and authentication issues
Agent fails immediately with an auth error
Agent fails immediately with an auth error
Symptoms: A task transitions quickly from
running to failed. The error message mentions an API key, authentication, or an invalid token.Diagnosis:- Check that the
CLAUDE_AUTH_MODEsecret is set to eitherapi-keyoroauth-token. If it is missing, the agent cannot determine how to authenticate. - For API key mode: confirm the
ANTHROPIC_API_KEYsecret exists in the Secrets page. - For OAuth token mode: confirm the
CLAUDE_CODE_OAUTH_TOKENsecret exists. This token may have expired — OAuth tokens fromclaude setup-tokencan have limited scopes. - Call
GET /api/auth/statusto check the validity of the stored credentials.
- Add or update the missing secret via the Secrets page (
/secrets). - For OAuth token mode, re-extract the token from the macOS Keychain using the one-liner in the setup wizard and paste it as the new
CLAUDE_CODE_OAUTH_TOKENsecret value. - After updating secrets, retry the task from its detail page.
Pod and infrastructure issues
Pod won't start / image pull errors
Pod won't start / image pull errors
Symptoms: A pod stays in
Pending or ImagePullBackOff. Tasks for the affected repository never advance past provisioning.Diagnosis:- Run
kubectl get pods -n optioand identify the pod with a non-Runningstatus. - Run
kubectl describe pod <pod-name> -n optioand read the Events section. AnImagePullBackOffevent indicates the agent image cannot be fetched. - Verify the agent image exists:
docker images | grep optio-agent. - Check the configured image pull policy. In local development, the policy must be
Neverso Kubernetes uses the locally built image instead of trying to pull from a registry.
- If using local images, ensure
OPTIO_IMAGE_PULL_POLICY=Neveris set (oragent.imagePullPolicy=Neverin Helm values). - Rebuild the agent images:
./images/build.sh. - If deploying to a remote cluster, push the agent image to a container registry and set the pull policy to
IfNotPresentorAlways. - After fixing the image issue, delete the stuck pod record from the Cluster view and retry the task to trigger a fresh pod creation.
Pod OOM-killed or crashed
Pod OOM-killed or crashed
Symptoms: A running task suddenly fails with an error about the pod crashing or being killed. The task may have been running for a while before failing.Diagnosis:
- Open the Cluster view (
/cluster) and check the pod’s health history. OOM kills and crashes are recorded there. - Query the
pod_health_eventstable for rows withevent_type = 'oom_killed'orevent_type = 'crashed'for the affected repository. - The cleanup worker (which runs every 60 seconds) automatically detects crashed pods, records the event, fails any tasks that were running on the dead pod, and removes the pod record so the next task recreates it cleanly.
- Increase the pod’s memory limit in the Helm chart or by using a higher-capacity image preset (e.g., upgrade from
nodetofull). - Reduce the number of concurrent agents per pod (
maxAgentsPerPod) to lower peak memory usage. - Once the resource limits are updated, retry the failed task. A new pod will be provisioned automatically.
Connectivity and logs
WebSocket connection drops / no live logs
WebSocket connection drops / no live logs
Symptoms: The log viewer in a task’s detail page shows no output, or the connection indicator shows “disconnected.” Logs may appear after the task completes but not in real time.Diagnosis:
- Live logs are streamed over WebSocket via a Redis pub/sub channel. Check that Redis is running and reachable from the API server.
- Confirm
REDIS_URLis correctly set in the API configuration. - Verify that the web application’s
NEXT_PUBLIC_API_URLenvironment variable points to the correct API host and port. A mismatch will cause WebSocket connections to fail silently. - If your deployment is behind a load balancer or ingress, check that WebSocket upgrades (
Upgrade: websocket) are not being stripped or blocked.
- Restart Redis if it is unhealthy and verify connectivity with
redis-cli ping. - Update the
NEXT_PUBLIC_API_URLin the web deployment and roll out the change. - For ingress-based deployments, add WebSocket-specific annotations (e.g.,
nginx.ingress.kubernetes.io/proxy-read-timeout).
CI and PR watcher
CI status never updates / PR watcher seems stuck
CI status never updates / PR watcher seems stuck
Symptoms: A task is in
pr_opened state and the CI status badge never changes from pending, even after checks complete on GitHub. Review tasks are not being triggered automatically.Diagnosis:- The PR watcher polls GitHub every 30 seconds. If no updates are appearing after several minutes, the watcher may be stalled.
- Check that the
GITHUB_TOKENsecret is set and has not expired. The watcher needs a valid token to call the GitHub API. - Look at the API server logs for errors from the PR watcher worker (
pr-watcher-worker.ts). Rate-limit errors from the GitHub API are a common cause. - Verify that the PR URL stored on the task is correct and that the PR is still open on GitHub.
- Rotate the
GITHUB_TOKENsecret via the Secrets page if it has expired. - If the watcher is rate-limited, increase
OPTIO_PR_WATCH_INTERVAL(default 30s) to reduce API call frequency. - To force an immediate PR status check, you can manually trigger a review via the task detail page (Request review action) or use
POST /api/tasks/:id/review.
Authentication and login
OAuth login fails
OAuth login fails
Symptoms: Clicking Sign in with GitHub (or Google/GitLab) redirects to the provider, but after authorizing you are sent back to an error page or the login page again. You may see an
invalid_state error.Diagnosis:- Confirm that
API_PUBLIC_URLandWEB_PUBLIC_URLare set to the actual public URLs of your deployment. These are used to construct the OAuth callback URL. - Check that the OAuth callback URL is registered with the provider. For GitHub, it should be
{API_PUBLIC_URL}/api/auth/github/callback. An unregistered callback URL causes the provider to reject the flow. - An
invalid_stateerror means the CSRF state token expired. This happens when more than 10 minutes pass between clicking the login button and completing the OAuth flow (e.g., a slow redirect or a cached login page).
- Update
API_PUBLIC_URLandWEB_PUBLIC_URLin your Helm values or environment config to match the actual deployment URLs, then redeploy. - Register the correct callback URL in your OAuth application settings on the provider’s dashboard.
- For
invalid_stateerrors, ask the user to start the login flow again from a fresh browser tab.
Recovering stuck or failed tasks
Optio provides two recovery actions on any task’s detail page:| Action | What it does |
|---|---|
| Force restart | Starts a fresh agent session on the existing PR branch. Use this when the agent failed mid-run but the PR and branch are still valid. |
| Force redo | Clears all task state and re-runs from scratch — new branch, new worktree, new agent session. Use this when the branch or PR is corrupted, or you want a completely clean attempt. |
