Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt

Use this file to discover all available pages before exploring further.

When you need more granular failure analysis than a binary confirmed/disputed verdict, single-prompt mode trades throughput for detail. Instead of two batched calls shared across all lemmas, it makes one API call per requirement-lemma pair and returns a four-way verdict that distinguishes partial justification from outright mismatch, and flags vacuous lemmas separately from lemmas that prove the wrong thing.

How single-prompt mode works

Enable it with the --single-prompt flag:
claimcheck \
  -m mappings/counter.json \
  --dfy claims/counter.dfy \
  -d counter \
  --single-prompt
For each requirement-lemma pair, Claimcheck sends a single prompt to claude-sonnet-4-5-20250929 (overridable with --model). The prompt asks the model to work in two named passes:
  1. Pass 1 (within the prompt) — Informalize the lemma in plain English, stating what it guarantees and under what conditions. The model is instructed to do this before reading the natural-language requirement.
  2. Pass 2 (within the same prompt) — Read the NL requirement and compare. Answer three structured questions: does the ensures clause express the NL claim (Yes / Partially / No), is the guarantee vacuous, and are there surprising restrictions in the requires clauses?
The model returns a structured result with a claimcheck field containing the full verdict detail, informalization, and sub-question answers.

Separation: prompt-level vs structural

The key difference from two-pass mode is where the separation happens:
Two-pass modeSingle-prompt mode
Separation typeStructuralPrompt-level
How it worksRequirement is absent from the Pass 1 prompt entirelyModel is instructed to informalize before reading the requirement
EnforcementPhysical — the requirement string is not in the API callBehavioral — relies on the model following instructions
Look-ahead riskNone — model cannot see what it is not givenPossible — nothing prevents the model from considering both artifacts together
Structural separation is stronger. In two-pass mode the informalization model has no opportunity to anchor on the requirement because it is not present. In single-prompt mode the model sees both artifacts in the same context window and is asked to evaluate them sequentially; a capable model generally follows this instruction, but it is not guaranteed.

Two-pass vs single-prompt comparison

Two-passSingle-prompt
API calls per run2 (batched)N (one per pair)
Accuracy (benchmarks)96.3% (104/108)86.1% (31/36)
Time per run (36 pairs)~108s~353s
Verdict richnessconfirmed / disputedJUSTIFIED / PARTIALLY_JUSTIFIED / NOT_JUSTIFIED / VACUOUS
Vacuity analysisNo (trivial-strength flag only)Yes
Surprising restrictionsNoYes
Default modelhaiku (Pass 1) + sonnet (Pass 2)sonnet
Benchmark figures are from 5 domains (counter, kanban, colorwheel, canon, delegation-auth) with 36 requirement-lemma pairs including 8 deliberately bogus lemmas. Two-pass had 3 runs; single-prompt had 1 run.

Verdict categories

Single-prompt mode returns one of four verdicts per pair:

JUSTIFIED

The lemma’s contract expresses the requirement. The ensures clause may be stronger than the requirement asks — that is acceptable. This is the only passing verdict.

PARTIALLY_JUSTIFIED

The ensures clause covers some but not all of the NL claim. The model identifies what is missing. Maps to weakened-postcondition in the weakening type taxonomy.

NOT_JUSTIFIED

There is a meaningful discrepancy: the lemma is weaker than required, proves something different, or misses key aspects of the requirement.

VACUOUS

The ensures clause already follows trivially from the requires clauses. The lemma proves nothing beyond its own assumptions. Maps to tautology in the weakening type taxonomy.

Vacuity analysis

A vacuous lemma is one whose ensures clause restates a requires clause without unfolding any definitions. For example:
lemma TallyMonotonic(ballots: seq<int>)
  requires Count(ballots) >= 0
  ensures Count(ballots) >= 0
{}
This lemma verifies in Dafny — it is trivially true. But it proves nothing: the postcondition is identical to the precondition. Claimcheck’s single-prompt mode detects this and returns VACUOUS, with a vacuousExplanation describing what the ensures clause repeats. Important distinction: A lemma that extracts a concrete consequence from an invariant is not vacuous. For example:
lemma CounterNonNegative(m: int)
  requires Inv(m)
  ensures m >= 0
{}
Here the ensures clause is not inside the requires clause — it is a specific projection of what Inv(m) entails. Single-prompt mode is calibrated not to flag this pattern.

Surprising restrictions

Beyond vacuity, the model checks whether the requires clauses impose preconditions that the NL requirement does not mention. Standard invariant dependencies (e.g. requires Inv(state)) and well-formedness conditions are expected and are not flagged. But a requires clause that meaningfully restricts when the property holds — in a way the requirement does not acknowledge — is reported in the surprisingRestrictions field. For example, if the requirement says “adding a ballot cannot decrease any candidate’s tally” but the lemma additionally requires |ballots| > 0, that restriction narrows the scope of the guarantee and would be surfaced here.

Output structure

In single-prompt mode each result includes a nested claimcheck field with the full structured output from the model. This preserves the richer sub-question answers alongside the mapped comparison shape used for report compatibility.
{
  "requirement": "Order of ballots doesn't affect the tally",
  "lemmaName": "OrderIrrelevant",
  "status": "disputed",
  "comparison": {
    "match": false,
    "discrepancy": "Lemma only proves prepend vs. append give the same count",
    "weakeningType": "narrowed-scope",
    "explanation": "...",
    "claimcheck": {
      "lemmaName": "OrderIrrelevant",
      "informalization": "This lemma guarantees that prepending an element gives the same count as appending it...",
      "ensuresMatchesNL": "Partially",
      "ensuresExplanation": "The requirement asks for full permutation invariance; the lemma covers only one specific reordering.",
      "vacuous": false,
      "vacuousExplanation": "",
      "surprisingRestrictions": "None",
      "verdict": "PARTIALLY_JUSTIFIED"
    }
  }
}

When to use single-prompt mode

Use single-prompt mode when:
  • You want to distinguish partial coverage from outright mismatch
  • You need explicit vacuity detection with explanations
  • You are debugging a disputed result and want the model’s full reasoning chain
  • You are running a small number of pairs where the per-call overhead is acceptable
Use two-pass mode (the default) when:
  • You are running a large batch and need results quickly
  • You want maximum accuracy — two-pass achieves 96.3% vs 86.1% in benchmarks
  • Structural separation is important for your trust model

Build docs developers (and LLMs) love