Claimcheck’s default mode solves a subtle evaluation problem: if you show a model the natural-language requirement before asking it to read the lemma, the model reads the lemma through that requirement. It anchors on what the requirement says and confirms that the lemma looks consistent — even when it isn’t. Two-pass mode prevents this by making structural separation a physical constraint, not just a prompt instruction. The model that informalizes the lemma literally cannot see the requirement; the model that compares sees both only after the back-translation is complete.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt
Use this file to discover all available pages before exploring further.
Why structural separation matters
When a single model evaluates a lemma against a requirement in one pass, it is susceptible to confirmation bias. Primed with “the lemma should prove X,” the model reads the ensures clause and finds X, even when the clause actually proves something weaker, narrower, or entirely different. This is anchoring bias: the first piece of information disproportionately shapes the evaluation. Two-pass mode eliminates the anchor by splitting the work across two separate API calls with different models:- Pass 1 (informalize) — the model sees only the Dafny code. It has no prior for what the lemma is supposed to mean.
- Pass 2 (compare) — a different model sees the original requirement, the back-translation from Pass 1, and the Dafny code. It compares meaning, not wording.
Structural separation is stronger than prompt-level separation. In single-prompt mode the model is instructed to informalize before reading the requirement, but nothing prevents look-ahead. In two-pass mode the requirement is not present in the Pass 1 prompt at all. See Single-prompt mode for a detailed comparison.
Algorithm flow
Receive input
Claimcheck accepts an array of claims, each with a
requirement (natural language), a lemmaName, and dafnyCode (the lemma’s requires/ensures contract plus proof body). A domain string is included in both prompts for context.Pass 1 — Informalize (Haiku, one batched call)
A single API call sends all lemma bodies to
claude-haiku-4-5-20251001. The prompt contains only the Dafny code — no requirements. For each lemma the model produces a structured informalization: a plain-English back-translation, separated preconditions and postcondition, scope characterization, a strength rating, and a confidence score.All lemmas are processed in one batch, so this step costs one API call regardless of how many lemmas are in the input.Pre-checks — deterministic diagnostics
Between the two passes, Claimcheck runs two deterministic checks against the informalizations:
- Trivial-strength flag — any lemma rated
trivialis logged. A trivial ensures clause that restates the requires clause or is a tautology almost always indicates a mismatch. - Duplicate postcondition detection — if two or more lemmas produce the same postcondition text, they are logged together. Identical postconditions across different requirements suggest one lemma may be doing double duty.
Pass 2 — Compare (Sonnet, one batched call)
A single API call sends all requirement-lemma pairs to
claude-sonnet-4-5-20250929. Each pair includes the original requirement, the full back-translation from Pass 1 (naturalLanguage, preconditions, postcondition, scope, strength), and the Dafny code.The model checks each pair for five specific weakening patterns and produces a structured comparison result per pair.All pairs are processed in one batch, so this step also costs one API call.Pass 1 output fields
The informalize step produces the following structured fields for each lemma:| Field | Type | Description |
|---|---|---|
naturalLanguage | string | Plain-English back-translation of what the lemma guarantees. The model is instructed to be literal — describe what the code says, not what the author intended. |
preconditions | string | The requires clauses translated to English. |
postcondition | string | The ensures clauses translated to English. |
scope | string | What the lemma applies to: a single state, a state transition, all reachable states, etc. |
strength | enum | How strong the claim is. One of trivial, weak, moderate, or strong. |
confidence | number | Model’s self-reported confidence (0–1) that the back-translation is faithful to the code. |
Strength ratings
| Value | Meaning |
|---|---|
trivial | The ensures clause restates the requires clause, is a tautology, or follows trivially from definitions. |
weak | The lemma says very little — for example, it ensures a value exists but not what it equals. |
moderate | The lemma makes a substantive claim about behavior. |
strong | The lemma significantly constrains the system’s behavior. |
Pass 2 output fields
The compare step produces the following structured fields for each requirement-lemma pair:| Field | Type | Description |
|---|---|---|
match | boolean | true if the lemma faithfully expresses the requirement; false if there is any meaningful discrepancy. |
discrepancy | string | Description of exactly what the lemma gets wrong or misses. Present when match is false. |
weakeningType | enum | Category of weakening detected, or none if match is true. |
explanation | string | Reasoning behind the comparison verdict. |
Weakening categories
The compare step classifies any mismatch into one of five categories:| Value | Description |
|---|---|
tautology | The ensures clause restates the requires clause — the lemma proves nothing. |
weakened-postcondition | The ensures clause says less than the requirement asks — for example, the requirement says “exactly 5 colors” but the lemma ensures “at least 1 color.” |
narrowed-scope | The lemma covers only a subset of the cases described by the requirement. |
missing-case | The requirement has multiple conditions but the lemma captures only some of them. |
wrong-property | The lemma proves something related but different from what the requirement asks. |
Performance
Two-pass mode makes exactly 2 batched API calls for any number of lemmas — one for informalization and one for comparison. This is the same cost whether you have 1 lemma or 100. In benchmarks across 5 domains with 36 requirement-lemma pairs, two-pass achieved 96.3% accuracy (104/108 correct across 3 runs) at approximately 108 seconds per run.Default models and overrides
| Step | Default model | Override flag |
|---|---|---|
| Informalize (Pass 1) | claude-haiku-4-5-20251001 | --informalize-model <id> |
| Compare (Pass 2) | claude-sonnet-4-5-20250929 | --compare-model <id> |
Model IDs vary by backend. When using
--vertex, --bedrock, or the Claude Code backend, Claimcheck substitutes the appropriate model identifiers automatically. You only need to override if you want a non-default model.Related
- Single-prompt mode and verdict categories — how the alternative mode differs and when to use it