When an agent crashes mid-sprint or a run is killed unexpectedly, RepoKernel does not leave you with a corrupted repository. A crashed or runaway agent leaves a resumable run, not a broken state — the transaction journal underDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/xantorres/repokernel/llms.txt
Use this file to discover all available pages before exploring further.
<git-common-dir>/repokernel/journal/ records every in-flight multi-file operation so rk recover can replay, mark complete, or quarantine it safely. Most failures can be healed and resumed without touching sprint files by hand.
Crashes leave operational state (run records, worktrees, journals) in need of repair — they do not corrupt your tracked sprint or registry files. The recovery tools operate on the per-clone state directory, not on your Git history.
rk recover — audit and repair operational state
rk recover audits and repairs operational state under <git-common-dir>/repokernel/. It covers four phases: journal replay, worktree cleanup, stale run records, and leaked lane claims.
Journal classification
When--apply is passed, every pending journal entry is classified and handled as follows:
| Classification | Detection | Outcome |
|---|---|---|
safe_replay | Schema valid, content hashes match nextHash, each uncompleted step’s current file matches prevHash | Replay incomplete steps, verify nextHash, rename pending → done |
already_applied | Schema valid, every uncompleted step’s current file already matches nextHash | Mark each step complete, rename pending → done. No file mutation |
diverged | Schema valid, but for some step the current file matches neither prevHash nor nextHash | Quarantine to .unrecoverable.<ts>.<rand>.json, surface P1 finding, exit non-zero |
unknown_schema | JSON parses but schemaVersion is outside the supported range | Leave pending (a newer rk may know how to replay), surface P1 finding, exit non-zero |
corrupt | JSON.parse throws, schema validation fails, or step.content SHA does not match step.nextHash | Quarantine, surface P1 finding, exit non-zero |
rk recover --apply writes a recover.report.json with the full set of actions taken. Unrecoverable journals are kept indefinitely as forensic state; completed journals are garbage-collected keeping the most recent 50.
Resuming a paused run
Runs pause for several reasons — awaiting review, hitting a--limit cap, agent failure, or a merge conflict on a parallel sprint branch. List paused runs and resume them:
--resume flag looks up the paused run record and picks up from where it stopped. You do not need to re-pass --agent or --limit — the run record stores those values.
Halt reasons and how to recover
| Halt reason | What happened | Recovery |
|---|---|---|
awaiting_review | Sprint complete; run waiting for a human review verdict | rk review-verdict R-NNN accepted then rk run --resume RUN-NNN |
awaiting_reviews | Parallel wave complete; all sprints in the wave need verdicts | Set a verdict for each review, then rk run --resume RUN-NNN |
limit_reached | Run hit the --limit N cap | rk run --resume RUN-NNN to continue |
agent_failed:<sprint-id> | Agent returned failed or blocked | rk run logs RUN-NNN <sprint-id>, fix the issue, start a fresh run |
merge_conflict:<sprint-id> | Parallel sprint branch could not merge cleanly into the epic worktree | Resolve the conflict manually in the epic worktree, then start a fresh run |
epic_completed | All sprints shipped — epic is done | rk epic ship E-001 to mark done, validate, and check registry |
no_runnable_sprints | Nothing eligible to run | rk next --json for blocking reasons; fix dependencies or add sprints to the queue |
rk fix — apply safe mechanical repairs
rk fix targets lifecycle-command fragments — partial mutations left by a crash between two sequential file writes. It covers findings like SHIPPED_SPRINT_IN_QUEUE and CANCELLED_SPRINT_IN_QUEUE that rk recover does not address.
rk fix --apply always runs both live and audit validator scopes, so it repairs historical-hygiene gaps as well as current-state fragments in one pass.
rk doctor — diagnose setup problems
rk doctor runs a comprehensive diagnostic over your RepoKernel installation:
.gitattributes merge driver entry, and all three merge.repokernel-registry.* git config keys. Missing or drifted entries are reported with exact remediation commands. Exits 1 when setup is incomplete.
Crash recovery procedure
Preview what's broken
Run
rk recover --preview to see all findings without making any changes. This is safe to run at any time and shows you exactly what the apply step will do.Heal operational state
Run
rk recover --apply to replay safe journal entries, quarantine unrecoverable ones, clean up orphaned worktrees and stale run records, and write recover.report.json.Apply lifecycle repairs
If any
SHIPPED_SPRINT_IN_QUEUE or similar findings are present, apply safe mechanical repairs:Check what's still blocking
After recovery, inspect remaining blockers before resuming:
rk validate surfaces any remaining P0/P1 findings that would block the run loop. rk next --json shows per-slot blocking reasons when nothing is runnable.Fixing a stuck active sprint
If a run terminated abnormally (crash, manual kill), a sprint may be left inactive status with no paused run record. To recover: