Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/xantorres/repokernel/llms.txt

Use this file to discover all available pages before exploring further.

When an agent crashes mid-sprint or a run is killed unexpectedly, RepoKernel does not leave you with a corrupted repository. A crashed or runaway agent leaves a resumable run, not a broken state — the transaction journal under <git-common-dir>/repokernel/journal/ records every in-flight multi-file operation so rk recover can replay, mark complete, or quarantine it safely. Most failures can be healed and resumed without touching sprint files by hand.
Crashes leave operational state (run records, worktrees, journals) in need of repair — they do not corrupt your tracked sprint or registry files. The recovery tools operate on the per-clone state directory, not on your Git history.

rk recover — audit and repair operational state

rk recover audits and repairs operational state under <git-common-dir>/repokernel/. It covers four phases: journal replay, worktree cleanup, stale run records, and leaked lane claims.
rk recover --preview        # list findings without mutating anything (default)
rk recover --dry-run        # alias for --preview
rk recover --apply          # heal everything: replay journals, quarantine, write recover.report.json
rk recover --journal-only --apply   # skip worktrees, runs, and lane-claim phases
rk recover --json           # JSON output of findings and planned actions

Journal classification

When --apply is passed, every pending journal entry is classified and handled as follows:
ClassificationDetectionOutcome
safe_replaySchema valid, content hashes match nextHash, each uncompleted step’s current file matches prevHashReplay incomplete steps, verify nextHash, rename pending → done
already_appliedSchema valid, every uncompleted step’s current file already matches nextHashMark each step complete, rename pending → done. No file mutation
divergedSchema valid, but for some step the current file matches neither prevHash nor nextHashQuarantine to .unrecoverable.<ts>.<rand>.json, surface P1 finding, exit non-zero
unknown_schemaJSON parses but schemaVersion is outside the supported rangeLeave pending (a newer rk may know how to replay), surface P1 finding, exit non-zero
corruptJSON.parse throws, schema validation fails, or step.content SHA does not match step.nextHashQuarantine, surface P1 finding, exit non-zero
rk recover --apply writes a recover.report.json with the full set of actions taken. Unrecoverable journals are kept indefinitely as forensic state; completed journals are garbage-collected keeping the most recent 50.

Resuming a paused run

Runs pause for several reasons — awaiting review, hitting a --limit cap, agent failure, or a merge conflict on a parallel sprint branch. List paused runs and resume them:
rk runs                        # list all runs and their status
rk runs --status paused        # filter to paused runs only
rk run inspect RUN-001         # show run state and next steps
rk run logs RUN-001            # show logs for a run
rk run --resume RUN-001        # resume from the last incomplete sprint
rk run abort RUN-001           # discard a paused run (sprints keep their current status)
The --resume flag looks up the paused run record and picks up from where it stopped. You do not need to re-pass --agent or --limit — the run record stores those values.

Halt reasons and how to recover

Halt reasonWhat happenedRecovery
awaiting_reviewSprint complete; run waiting for a human review verdictrk review-verdict R-NNN accepted then rk run --resume RUN-NNN
awaiting_reviewsParallel wave complete; all sprints in the wave need verdictsSet a verdict for each review, then rk run --resume RUN-NNN
limit_reachedRun hit the --limit N caprk run --resume RUN-NNN to continue
agent_failed:<sprint-id>Agent returned failed or blockedrk run logs RUN-NNN <sprint-id>, fix the issue, start a fresh run
merge_conflict:<sprint-id>Parallel sprint branch could not merge cleanly into the epic worktreeResolve the conflict manually in the epic worktree, then start a fresh run
epic_completedAll sprints shipped — epic is donerk epic ship E-001 to mark done, validate, and check registry
no_runnable_sprintsNothing eligible to runrk next --json for blocking reasons; fix dependencies or add sprints to the queue

rk fix — apply safe mechanical repairs

rk fix targets lifecycle-command fragments — partial mutations left by a crash between two sequential file writes. It covers findings like SHIPPED_SPRINT_IN_QUEUE and CANCELLED_SPRINT_IN_QUEUE that rk recover does not address.
rk fix --preview        # show what would be fixed
rk fix --apply --yes    # write all repairs classified as safe
rk fix --apply always runs both live and audit validator scopes, so it repairs historical-hygiene gaps as well as current-state fragments in one pass.

rk doctor — diagnose setup problems

rk doctor runs a comprehensive diagnostic over your RepoKernel installation:
rk doctor
It checks config, git setup, paths, queues, registry integrity, the .gitattributes merge driver entry, and all three merge.repokernel-registry.* git config keys. Missing or drifted entries are reported with exact remediation commands. Exits 1 when setup is incomplete.

Crash recovery procedure

1

Preview what's broken

Run rk recover --preview to see all findings without making any changes. This is safe to run at any time and shows you exactly what the apply step will do.
rk recover --preview
2

Heal operational state

Run rk recover --apply to replay safe journal entries, quarantine unrecoverable ones, clean up orphaned worktrees and stale run records, and write recover.report.json.
rk recover --apply
3

Apply lifecycle repairs

If any SHIPPED_SPRINT_IN_QUEUE or similar findings are present, apply safe mechanical repairs:
rk fix --preview
rk fix --apply --yes
4

Check what's still blocking

After recovery, inspect remaining blockers before resuming:
rk validate --json
rk next --json
rk validate surfaces any remaining P0/P1 findings that would block the run loop. rk next --json shows per-slot blocking reasons when nothing is runnable.
5

Resume or retry

Resume the paused run with --resume, or if the sprint itself failed, start a fresh run targeting the next sprint:
rk run --resume RUN-001          # resume from last incomplete sprint
# or
rk run T-NNN --agent claude      # retry a specific sprint

Fixing a stuck active sprint

If a run terminated abnormally (crash, manual kill), a sprint may be left in active status with no paused run record. To recover:
# 1. See what is active
rk status --json
rk validate --json

# 2. If the sprint should be reviewed
rk review S-002
rk review-verdict R-002 accepted
rk ship S-002

# 3. Start a new run for the next sprint
rk run E-001 --agent claude --limit 1

Registry drift after recovery

After manual lifecycle operations, the registry may be out of sync with disk state. Regenerate it:
rk registry --write    # regenerate from disk state
rk registry --check    # verify integrity

Build docs developers (and LLMs) love