Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt

Use this file to discover all available pages before exploring further.

Claimcheck’s default mode solves a subtle evaluation problem: if you show a model the natural-language requirement before asking it to read the lemma, the model reads the lemma through that requirement. It anchors on what the requirement says and confirms that the lemma looks consistent — even when it isn’t. Two-pass mode prevents this by making structural separation a physical constraint, not just a prompt instruction. The model that informalizes the lemma literally cannot see the requirement; the model that compares sees both only after the back-translation is complete.

Why structural separation matters

When a single model evaluates a lemma against a requirement in one pass, it is susceptible to confirmation bias. Primed with “the lemma should prove X,” the model reads the ensures clause and finds X, even when the clause actually proves something weaker, narrower, or entirely different. This is anchoring bias: the first piece of information disproportionately shapes the evaluation. Two-pass mode eliminates the anchor by splitting the work across two separate API calls with different models:
  • Pass 1 (informalize) — the model sees only the Dafny code. It has no prior for what the lemma is supposed to mean.
  • Pass 2 (compare) — a different model sees the original requirement, the back-translation from Pass 1, and the Dafny code. It compares meaning, not wording.
Because Pass 1 never receives the requirements array, any match or mismatch it surfaces is purely structural — a property of what the code actually says.
Structural separation is stronger than prompt-level separation. In single-prompt mode the model is instructed to informalize before reading the requirement, but nothing prevents look-ahead. In two-pass mode the requirement is not present in the Pass 1 prompt at all. See Single-prompt mode for a detailed comparison.

Algorithm flow

1

Receive input

Claimcheck accepts an array of claims, each with a requirement (natural language), a lemmaName, and dafnyCode (the lemma’s requires/ensures contract plus proof body). A domain string is included in both prompts for context.
2

Pass 1 — Informalize (Haiku, one batched call)

A single API call sends all lemma bodies to claude-haiku-4-5-20251001. The prompt contains only the Dafny code — no requirements. For each lemma the model produces a structured informalization: a plain-English back-translation, separated preconditions and postcondition, scope characterization, a strength rating, and a confidence score.All lemmas are processed in one batch, so this step costs one API call regardless of how many lemmas are in the input.
3

Pre-checks — deterministic diagnostics

Between the two passes, Claimcheck runs two deterministic checks against the informalizations:
  1. Trivial-strength flag — any lemma rated trivial is logged. A trivial ensures clause that restates the requires clause or is a tautology almost always indicates a mismatch.
  2. Duplicate postcondition detection — if two or more lemmas produce the same postcondition text, they are logged together. Identical postconditions across different requirements suggest one lemma may be doing double duty.
Both checks are diagnostic warnings only. They do not change the final verdict; that determination belongs to Pass 2.
4

Pass 2 — Compare (Sonnet, one batched call)

A single API call sends all requirement-lemma pairs to claude-sonnet-4-5-20250929. Each pair includes the original requirement, the full back-translation from Pass 1 (naturalLanguage, preconditions, postcondition, scope, strength), and the Dafny code.The model checks each pair for five specific weakening patterns and produces a structured comparison result per pair.All pairs are processed in one batch, so this step also costs one API call.
5

Verdict

Each pair’s verdict is derived directly from the match field in the comparison output:
  • match: trueconfirmed — the lemma faithfully expresses the requirement
  • match: falsedisputed — a discrepancy was found, with a category and explanation

Pass 1 output fields

The informalize step produces the following structured fields for each lemma:
FieldTypeDescription
naturalLanguagestringPlain-English back-translation of what the lemma guarantees. The model is instructed to be literal — describe what the code says, not what the author intended.
preconditionsstringThe requires clauses translated to English.
postconditionstringThe ensures clauses translated to English.
scopestringWhat the lemma applies to: a single state, a state transition, all reachable states, etc.
strengthenumHow strong the claim is. One of trivial, weak, moderate, or strong.
confidencenumberModel’s self-reported confidence (0–1) that the back-translation is faithful to the code.

Strength ratings

ValueMeaning
trivialThe ensures clause restates the requires clause, is a tautology, or follows trivially from definitions.
weakThe lemma says very little — for example, it ensures a value exists but not what it equals.
moderateThe lemma makes a substantive claim about behavior.
strongThe lemma significantly constrains the system’s behavior.
A trivial strength rating from Pass 1 is a strong signal of a mismatch. Unless the original requirement is itself trivial, the compare model will almost always flag it as a discrepancy.

Pass 2 output fields

The compare step produces the following structured fields for each requirement-lemma pair:
FieldTypeDescription
matchbooleantrue if the lemma faithfully expresses the requirement; false if there is any meaningful discrepancy.
discrepancystringDescription of exactly what the lemma gets wrong or misses. Present when match is false.
weakeningTypeenumCategory of weakening detected, or none if match is true.
explanationstringReasoning behind the comparison verdict.

Weakening categories

The compare step classifies any mismatch into one of five categories:
ValueDescription
tautologyThe ensures clause restates the requires clause — the lemma proves nothing.
weakened-postconditionThe ensures clause says less than the requirement asks — for example, the requirement says “exactly 5 colors” but the lemma ensures “at least 1 color.”
narrowed-scopeThe lemma covers only a subset of the cases described by the requirement.
missing-caseThe requirement has multiple conditions but the lemma captures only some of them.
wrong-propertyThe lemma proves something related but different from what the requirement asks.

Performance

Two-pass mode makes exactly 2 batched API calls for any number of lemmas — one for informalization and one for comparison. This is the same cost whether you have 1 lemma or 100. In benchmarks across 5 domains with 36 requirement-lemma pairs, two-pass achieved 96.3% accuracy (104/108 correct across 3 runs) at approximately 108 seconds per run.

Default models and overrides

StepDefault modelOverride flag
Informalize (Pass 1)claude-haiku-4-5-20251001--informalize-model <id>
Compare (Pass 2)claude-sonnet-4-5-20250929--compare-model <id>
# Override both models
claimcheck \
  -m mappings/counter.json \
  --dfy claims/counter.dfy \
  -d counter \
  --informalize-model claude-haiku-4-5-20251001 \
  --compare-model claude-sonnet-4-5-20250929
Model IDs vary by backend. When using --vertex, --bedrock, or the Claude Code backend, Claimcheck substitutes the appropriate model identifiers automatically. You only need to override if you want a non-default model.

Build docs developers (and LLMs) love