Single-prompt mode and verdict categories

When you need more granular failure analysis than a binary confirmed/disputed verdict, single-prompt mode trades throughput for detail. Instead of two batched calls shared across all lemmas, it makes one API call per requirement-lemma pair and returns a four-way verdict that distinguishes partial justification from outright mismatch, and flags vacuous lemmas separately from lemmas that prove the wrong thing.

How single-prompt mode works

Enable it with the --single-prompt flag:

claimcheck \
  -m mappings/counter.json \
  --dfy claims/counter.dfy \
  -d counter \
  --single-prompt

For each requirement-lemma pair, Claimcheck sends a single prompt to claude-sonnet-4-5-20250929 (overridable with --model). The prompt asks the model to work in two named passes:

Pass 1 (within the prompt) — Informalize the lemma in plain English, stating what it guarantees and under what conditions. The model is instructed to do this before reading the natural-language requirement.
Pass 2 (within the same prompt) — Read the NL requirement and compare. Answer three structured questions: does the ensures clause express the NL claim (Yes / Partially / No), is the guarantee vacuous, and are there surprising restrictions in the requires clauses?

The model returns a structured result with a claimcheck field containing the full verdict detail, informalization, and sub-question answers.

Separation: prompt-level vs structural

The key difference from two-pass mode is where the separation happens:

	Two-pass mode	Single-prompt mode
Separation type	Structural	Prompt-level
How it works	Requirement is absent from the Pass 1 prompt entirely	Model is instructed to informalize before reading the requirement
Enforcement	Physical — the requirement string is not in the API call	Behavioral — relies on the model following instructions
Look-ahead risk	None — model cannot see what it is not given	Possible — nothing prevents the model from considering both artifacts together

Structural separation is stronger. In two-pass mode the informalization model has no opportunity to anchor on the requirement because it is not present. In single-prompt mode the model sees both artifacts in the same context window and is asked to evaluate them sequentially; a capable model generally follows this instruction, but it is not guaranteed.

Two-pass vs single-prompt comparison

	Two-pass	Single-prompt
API calls per run	2 (batched)	N (one per pair)
Accuracy (benchmarks)	96.3% (104/108)	86.1% (31/36)
Time per run (36 pairs)	~108s	~353s
Verdict richness	confirmed / disputed	JUSTIFIED / PARTIALLY_JUSTIFIED / NOT_JUSTIFIED / VACUOUS
Vacuity analysis	No (trivial-strength flag only)	Yes
Surprising restrictions	No	Yes
Default model	haiku (Pass 1) + sonnet (Pass 2)	sonnet

Benchmark figures are from 5 domains (counter, kanban, colorwheel, canon, delegation-auth) with 36 requirement-lemma pairs including 8 deliberately bogus lemmas. Two-pass had 3 runs; single-prompt had 1 run.

Verdict categories

Single-prompt mode returns one of four verdicts per pair:

JUSTIFIED

The lemma’s contract expresses the requirement. The ensures clause may be stronger than the requirement asks — that is acceptable. This is the only passing verdict.

PARTIALLY_JUSTIFIED

The ensures clause covers some but not all of the NL claim. The model identifies what is missing. Maps to weakened-postcondition in the weakening type taxonomy.

NOT_JUSTIFIED

There is a meaningful discrepancy: the lemma is weaker than required, proves something different, or misses key aspects of the requirement.

VACUOUS

The ensures clause already follows trivially from the requires clauses. The lemma proves nothing beyond its own assumptions. Maps to tautology in the weakening type taxonomy.

Vacuity analysis

A vacuous lemma is one whose ensures clause restates a requires clause without unfolding any definitions. For example:

lemma TallyMonotonic(ballots: seq<int>)
  requires Count(ballots) >= 0
  ensures Count(ballots) >= 0
{}

This lemma verifies in Dafny — it is trivially true. But it proves nothing: the postcondition is identical to the precondition. Claimcheck’s single-prompt mode detects this and returns VACUOUS, with a vacuousExplanation describing what the ensures clause repeats. Important distinction: A lemma that extracts a concrete consequence from an invariant is not vacuous. For example:

lemma CounterNonNegative(m: int)
  requires Inv(m)
  ensures m >= 0
{}

Here the ensures clause is not inside the requires clause — it is a specific projection of what Inv(m) entails. Single-prompt mode is calibrated not to flag this pattern.

Surprising restrictions

Beyond vacuity, the model checks whether the requires clauses impose preconditions that the NL requirement does not mention. Standard invariant dependencies (e.g. requires Inv(state)) and well-formedness conditions are expected and are not flagged. But a requires clause that meaningfully restricts when the property holds — in a way the requirement does not acknowledge — is reported in the surprisingRestrictions field. For example, if the requirement says “adding a ballot cannot decrease any candidate’s tally” but the lemma additionally requires |ballots| > 0, that restriction narrows the scope of the guarantee and would be surfaced here.

Output structure

In single-prompt mode each result includes a nested claimcheck field with the full structured output from the model. This preserves the richer sub-question answers alongside the mapped comparison shape used for report compatibility.

{
  "requirement": "Order of ballots doesn't affect the tally",
  "lemmaName": "OrderIrrelevant",
  "status": "disputed",
  "comparison": {
    "match": false,
    "discrepancy": "Lemma only proves prepend vs. append give the same count",
    "weakeningType": "narrowed-scope",
    "explanation": "...",
    "claimcheck": {
      "lemmaName": "OrderIrrelevant",
      "informalization": "This lemma guarantees that prepending an element gives the same count as appending it...",
      "ensuresMatchesNL": "Partially",
      "ensuresExplanation": "The requirement asks for full permutation invariance; the lemma covers only one specific reordering.",
      "vacuous": false,
      "vacuousExplanation": "",
      "surprisingRestrictions": "None",
      "verdict": "PARTIALLY_JUSTIFIED"
    }
  }
}

When to use single-prompt mode

Use single-prompt mode when:

You want to distinguish partial coverage from outright mismatch
You need explicit vacuity detection with explanations
You are debugging a disputed result and want the model’s full reasoning chain
You are running a small number of pairs where the per-call overhead is acceptable

Use two-pass mode (the default) when:

You are running a large batch and need results quickly
You want maximum accuracy — two-pass achieves 96.3% vs 86.1% in benchmarks
Structural separation is important for your trust model

Two-pass round-trip verification algorithm — how the default mode works and why structural separation matters
Benchmark results — full accuracy and timing data across all variants

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Single-prompt mode and verdict categories

How single-prompt mode works

Separation: prompt-level vs structural

Two-pass vs single-prompt comparison

Verdict categories

JUSTIFIED

PARTIALLY_JUSTIFIED

NOT_JUSTIFIED

VACUOUS

Vacuity analysis

Surprising restrictions

Output structure

When to use single-prompt mode

Build docs developers (and LLMs) love

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Documentation Index

​How single-prompt mode works

​Separation: prompt-level vs structural

​Two-pass vs single-prompt comparison

​Verdict categories

JUSTIFIED

PARTIALLY_JUSTIFIED

NOT_JUSTIFIED

VACUOUS

​Vacuity analysis

​Surprising restrictions

​Output structure

​When to use single-prompt mode

​Related

Build docs developers (and LLMs) love

How single-prompt mode works

Separation: prompt-level vs structural

Two-pass vs single-prompt comparison

Verdict categories

Vacuity analysis

Surprising restrictions

Output structure

When to use single-prompt mode

Related