Two-pass round-trip verification algorithm

Claimcheck’s default mode solves a subtle evaluation problem: if you show a model the natural-language requirement before asking it to read the lemma, the model reads the lemma through that requirement. It anchors on what the requirement says and confirms that the lemma looks consistent — even when it isn’t. Two-pass mode prevents this by making structural separation a physical constraint, not just a prompt instruction. The model that informalizes the lemma literally cannot see the requirement; the model that compares sees both only after the back-translation is complete.

Why structural separation matters

When a single model evaluates a lemma against a requirement in one pass, it is susceptible to confirmation bias. Primed with “the lemma should prove X,” the model reads the ensures clause and finds X, even when the clause actually proves something weaker, narrower, or entirely different. This is anchoring bias: the first piece of information disproportionately shapes the evaluation. Two-pass mode eliminates the anchor by splitting the work across two separate API calls with different models:

Pass 1 (informalize) — the model sees only the Dafny code. It has no prior for what the lemma is supposed to mean.
Pass 2 (compare) — a different model sees the original requirement, the back-translation from Pass 1, and the Dafny code. It compares meaning, not wording.

Because Pass 1 never receives the requirements array, any match or mismatch it surfaces is purely structural — a property of what the code actually says.

Structural separation is stronger than prompt-level separation. In single-prompt mode the model is instructed to informalize before reading the requirement, but nothing prevents look-ahead. In two-pass mode the requirement is not present in the Pass 1 prompt at all. See Single-prompt mode for a detailed comparison.

Algorithm flow

Receive input

Claimcheck accepts an array of claims, each with a requirement (natural language), a lemmaName, and dafnyCode (the lemma’s requires/ensures contract plus proof body). A domain string is included in both prompts for context.

Pass 1 — Informalize (Haiku, one batched call)

A single API call sends all lemma bodies to claude-haiku-4-5-20251001. The prompt contains only the Dafny code — no requirements. For each lemma the model produces a structured informalization: a plain-English back-translation, separated preconditions and postcondition, scope characterization, a strength rating, and a confidence score.All lemmas are processed in one batch, so this step costs one API call regardless of how many lemmas are in the input.

Pre-checks — deterministic diagnostics

Between the two passes, Claimcheck runs two deterministic checks against the informalizations:

Trivial-strength flag — any lemma rated trivial is logged. A trivial ensures clause that restates the requires clause or is a tautology almost always indicates a mismatch.
Duplicate postcondition detection — if two or more lemmas produce the same postcondition text, they are logged together. Identical postconditions across different requirements suggest one lemma may be doing double duty.

Both checks are diagnostic warnings only. They do not change the final verdict; that determination belongs to Pass 2.

Pass 2 — Compare (Sonnet, one batched call)

A single API call sends all requirement-lemma pairs to claude-sonnet-4-5-20250929. Each pair includes the original requirement, the full back-translation from Pass 1 (naturalLanguage, preconditions, postcondition, scope, strength), and the Dafny code.The model checks each pair for five specific weakening patterns and produces a structured comparison result per pair.All pairs are processed in one batch, so this step also costs one API call.

Verdict

Each pair’s verdict is derived directly from the match field in the comparison output:

match: true → confirmed — the lemma faithfully expresses the requirement
match: false → disputed — a discrepancy was found, with a category and explanation

Pass 1 output fields

The informalize step produces the following structured fields for each lemma:

Field	Type	Description
`naturalLanguage`	string	Plain-English back-translation of what the lemma guarantees. The model is instructed to be literal — describe what the code says, not what the author intended.
`preconditions`	string	The requires clauses translated to English.
`postcondition`	string	The ensures clauses translated to English.
`scope`	string	What the lemma applies to: a single state, a state transition, all reachable states, etc.
`strength`	enum	How strong the claim is. One of `trivial`, `weak`, `moderate`, or `strong`.
`confidence`	number	Model’s self-reported confidence (0–1) that the back-translation is faithful to the code.

Strength ratings

Value	Meaning
`trivial`	The ensures clause restates the requires clause, is a tautology, or follows trivially from definitions.
`weak`	The lemma says very little — for example, it ensures a value exists but not what it equals.
`moderate`	The lemma makes a substantive claim about behavior.
`strong`	The lemma significantly constrains the system’s behavior.

A trivial strength rating from Pass 1 is a strong signal of a mismatch. Unless the original requirement is itself trivial, the compare model will almost always flag it as a discrepancy.

Pass 2 output fields

The compare step produces the following structured fields for each requirement-lemma pair:

Field	Type	Description
`match`	boolean	`true` if the lemma faithfully expresses the requirement; `false` if there is any meaningful discrepancy.
`discrepancy`	string	Description of exactly what the lemma gets wrong or misses. Present when `match` is `false`.
`weakeningType`	enum	Category of weakening detected, or `none` if `match` is `true`.
`explanation`	string	Reasoning behind the comparison verdict.

Weakening categories

The compare step classifies any mismatch into one of five categories:

Value	Description
`tautology`	The ensures clause restates the requires clause — the lemma proves nothing.
`weakened-postcondition`	The ensures clause says less than the requirement asks — for example, the requirement says “exactly 5 colors” but the lemma ensures “at least 1 color.”
`narrowed-scope`	The lemma covers only a subset of the cases described by the requirement.
`missing-case`	The requirement has multiple conditions but the lemma captures only some of them.
`wrong-property`	The lemma proves something related but different from what the requirement asks.

Performance

Two-pass mode makes exactly 2 batched API calls for any number of lemmas — one for informalization and one for comparison. This is the same cost whether you have 1 lemma or 100. In benchmarks across 5 domains with 36 requirement-lemma pairs, two-pass achieved 96.3% accuracy (104/108 correct across 3 runs) at approximately 108 seconds per run.

Default models and overrides

Step	Default model	Override flag
Informalize (Pass 1)	`claude-haiku-4-5-20251001`	`--informalize-model <id>`
Compare (Pass 2)	`claude-sonnet-4-5-20250929`	`--compare-model <id>`

# Override both models
claimcheck \
  -m mappings/counter.json \
  --dfy claims/counter.dfy \
  -d counter \
  --informalize-model claude-haiku-4-5-20251001 \
  --compare-model claude-sonnet-4-5-20250929

Model IDs vary by backend. When using --vertex, --bedrock, or the Claude Code backend, Claimcheck substitutes the appropriate model identifiers automatically. You only need to override if you want a non-default model.

Single-prompt mode and verdict categories — how the alternative mode differs and when to use it

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Two-pass round-trip verification algorithm

Why structural separation matters

Algorithm flow

Pass 1 output fields

Strength ratings

Pass 2 output fields

Weakening categories

Performance

Default models and overrides

Build docs developers (and LLMs) love

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Documentation Index

​Why structural separation matters

​Algorithm flow

​Pass 1 output fields

​Strength ratings

​Pass 2 output fields

​Weakening categories

​Performance

​Default models and overrides

​Related

Build docs developers (and LLMs) love

Why structural separation matters

Algorithm flow

Pass 1 output fields

Strength ratings

Pass 2 output fields

Weakening categories

Performance

Default models and overrides

Related