End-to-end workflow for auditing Dafny claims

This guide walks you through the full Claimcheck workflow: authoring a mapping file that links natural-language requirements to Dafny lemmas, running the audit, understanding what each result status means, and acting on disputed claims based on the specific type of discrepancy detected. It also covers advanced options — adding Dafny verification, using single-prompt mode for richer analysis, and integrating Claimcheck into CI pipelines and programmatic toolchains.

The workflow cycle

Claimcheck is designed to slot into an iterative development loop. You write or revise lemmas, record which requirement each lemma covers, run the audit, fix what fails, and repeat until all claims are confirmed.

Write or update Dafny lemmas

Author lemmas in a .dfy file that formalize your natural-language requirements. Each lemma should express exactly one requirement so that the mapping stays unambiguous.

lemma CounterNonNegative(m: int)
  requires Inv(m)
  ensures m >= 0
{}

Create the mapping file

Write a JSON file that pairs each requirement string with the name of the lemma that covers it. The array order does not matter.

[
  {
    "requirement": "The counter value is always non-negative",
    "lemmaName": "CounterNonNegative"
  },
  {
    "requirement": "Adding a ballot can't decrease a tally",
    "lemmaName": "TallyMonotonic"
  }
]

See Structuring the mapping file for best practices.

Run Claimcheck

Point Claimcheck at the mapping and the .dfy file:

claimcheck \
  -m mapping.json \
  --dfy claims.dfy \
  -d mydomain \
  --json

Claimcheck runs a two-pass round-trip: it first informalizes each lemma back to plain English (without seeing the requirement), then compares the back-translation to the original requirement.

Interpret results and fix disputed claims

Each entry is tagged confirmed, disputed, or error. Disputed entries include a weakeningType that tells you what kind of fix is needed. See Interpreting results for the full breakdown.

Re-run until all claims are confirmed

After fixing disputed lemmas or refining requirements, re-run Claimcheck. Repeat until every entry reports confirmed.

In practice, a single iteration catches most issues. The DEMO election-tally example went from 2 disputed to 0 in one round of fixes.

Structuring the mapping file

The mapping file is a JSON array of { requirement, lemmaName } objects.

[
  {
    "requirement": "The counter value is always non-negative",
    "lemmaName": "CounterNonNegative"
  },
  {
    "requirement": "Order of ballots doesn't affect the tally",
    "lemmaName": "OrderIrrelevant"
  }
]

Best practices:

One requirement per lemma. Splitting a broad requirement into multiple lemmas makes disputes easier to localize.
Write requirements before lemmas. Drafting the natural-language requirement first reduces anchoring — you’re less likely to unconsciously shape the requirement to match the lemma you already wrote.
Use precise language. Vague requirements like “the system behaves correctly” are hard for Claimcheck to evaluate. Prefer specific, observable properties: “the counter never exceeds 100.”
Keep lemmaName in sync. If you rename a lemma in the .dfy file, update the mapping. A mismatched name produces an error result, not a meaningful audit.

The --module flag is only needed when using --verify with module-based .dfy files. It is not required for semantic auditing alone.

Interpreting results

Each mapping entry resolves to one of three statuses:

Status	Meaning	Action
`confirmed`	Round-trip passed — the lemma faithfully expresses the requirement	None needed
`disputed`	Round-trip failed — discrepancy between lemma meaning and requirement	Fix the lemma or refine the requirement
`error`	Lemma not found in the source file	Check that `lemmaName` matches exactly

A confirmed result means no action is required. The back-translation aligned with the original requirement, so the lemma is doing what it claims. A disputed result means Claimcheck detected a gap. The output includes a discrepancy field describing what went wrong and a weakeningType field that categorizes the failure. Use the weakeningType to decide how to fix it.

Acting on weakeningType categories

tautology — add a real postcondition

The lemma proves something trivially true, typically because the conclusion was smuggled into the preconditions. The requires clause contains the same assertion as ensures, making the proof vacuously valid.Fix: Remove the extra requires clause and write a genuine ensures that captures the requirement. If Dafny can’t verify the empty body without the extra precondition, the requirement may not hold in the domain — which is exactly the information you need.

// Before: tautology
lemma CounterNeverExceedsHundred(m: D.Model)
  requires D.Inv(m)
  requires m <= 100    // ← smuggled conclusion
  ensures m <= 100     // ← trivially true
{}

// After: real postcondition (may fail if domain has no upper bound)
lemma CounterNeverExceedsHundred(m: D.Model)
  requires D.Inv(m)
  ensures m <= 100
{}

weakened-postcondition — strengthen the ensures clause

The lemma proves a weaker property than the requirement asks for. The ensures clause is a genuine consequence of the invariant, but it expresses something narrower or less specific than the original intent.Fix: Replace the ensures clause with one that directly expresses the requirement. This may require discovering or defining the right predicate in your domain spec. If Dafny cannot verify the stronger ensures, the requirement may not be provable from the invariant alone — which is honest and useful information.

// Before: weakened ensures (just checks palette size, not mood constraints)
lemma MoodConstraintsSatisfied(m: D.Model)
  requires D.Inv(m)
  ensures |m.colors| == 5

// After: expresses the actual requirement
lemma MoodConstraintsSatisfied(m: D.Model)
  requires D.Inv(m)
  ensures CWSpec.SatisfiesMood(m.colors, m.mood)

narrowed-scope — broaden requires or rethink the lemma

The lemma is valid but only over a subset of the cases the requirement covers. The requires clause restricts the domain too aggressively, making the lemma true in a narrower scenario than the requirement intends.Fix: Relax or remove the narrowing requires clause and verify whether the property holds over the full intended scope. If it doesn’t, the requirement may need to be split into cases, or the domain invariant may need strengthening.Example from the demo: OrderIrrelevant only proved that prepend and append give the same count (one specific reordering), not that any permutation of ballots produces the same tally. Fix: rewrite using multisets to cover all permutations.

wrong-property — rewrite the lemma from scratch

The lemma proves something unrelated to the requirement. The ensures clause addresses a different property entirely, not just a weaker or narrower version.Fix: Discard the existing ensures clause and rethink the formalization from scratch, starting from the natural-language requirement. Consider whether the domain spec exposes the right predicates to express the requirement.Example from the demo: TallyMonotonic (requirement: “adding a ballot can’t decrease a tally”) proved Count(...) >= 0 — trivially true since counts are natural numbers. The fix was to compare the tally before and after adding a ballot.

Running with —verify

Add --verify to also run dafny verify on each lemma. This checks Dafny correctness in addition to semantic faithfulness.

claimcheck \
  -m mapping.json \
  --dfy claims.dfy \
  --module MyDomain \
  -d mydomain \
  --verify

--verify requires dafny in your PATH. The --module flag is required when using --verify with module-based .dfy files.

A verify-failed result means the lemma does not compile or the proof is invalid — a separate problem from the semantic audit. Fix Dafny verification failures first, then re-run the semantic check.

Using —single-prompt for richer dispute analysis

By default, Claimcheck uses a two-pass architecture: a fast back-translation model (Haiku) informalizes lemmas without seeing the requirements, and a comparison model (Sonnet) checks alignment. This gives the best accuracy and speed (96.3% at ~108 seconds per run). Add --single-prompt to switch to a single LLM call per pair that performs both passes sequentially and returns richer verdict categories:

claimcheck \
  -m mapping.json \
  --dfy claims.dfy \
  -d mydomain \
  --single-prompt \
  --json

Single-prompt mode returns four verdict categories instead of confirmed/disputed:

Verdict	Meaning
`JUSTIFIED`	Lemma faithfully expresses the requirement
`PARTIALLY_JUSTIFIED`	Partially matches but misses part of the requirement
`NOT_JUSTIFIED`	Does not express the requirement
`VACUOUS`	Lemma is vacuously true (tautology or empty ensures)

Use --single-prompt when you want more explanatory detail on disputed claims — for example, during active development or code review. Use the default two-pass mode in CI for speed and accuracy.

Integration tips

Running in CI

Add Claimcheck to your CI pipeline with --json to get machine-readable output, then fail the build on any disputed or error results:

claimcheck \
  -m mapping.json \
  --dfy claims.dfy \
  -d mydomain \
  --json > results.json

# Fail if any disputed or error results exist
node -e "
  const r = require('./results.json');
  const bad = r.filter(x => x.status !== 'confirmed');
  if (bad.length) { console.error(bad); process.exit(1); }
"

Using —json for machine-readable output

--json outputs a JSON object to stdout with domain, timestamp, results, and tokenUsage fields. Each entry in results contains at minimum requirement, lemmaName, and status. Disputed entries also include discrepancy and weakeningType.

{
  "domain": "counter",
  "timestamp": "2024-11-01T10:23:45.123Z",
  "results": [
    { "requirement": "The counter value is always non-negative", "lemmaName": "CounterNonNegative", "status": "confirmed" },
    {
      "requirement": "Adding a ballot can't decrease a tally",
      "lemmaName": "TallyMonotonic",
      "status": "disputed",
      "discrepancy": "Lemma only checks non-negativity, not monotonicity across ballot additions.",
      "weakeningType": "wrong-property"
    }
  ],
  "tokenUsage": { "input": 1240, "output": 312 }
}

Using stdin mode for programmatic use

If you have already extracted lemma code (for example, from a build system or code generation pipeline), use --stdin to pipe pre-extracted claims directly:

echo '{
  "claims": [
    {
      "requirement": "The counter value is always non-negative",
      "lemmaName": "CounterNonNegative",
      "dafnyCode": "lemma CounterNonNegative(m: int)\n  requires Inv(m)\n  ensures m >= 0\n{}"
    }
  ],
  "domain": "counter"
}' | claimcheck --stdin

Stdin mode is pure JSON-in / JSON-out with no file extraction step, making it straightforward to integrate with code generation tools or programmatic workflows.

Library usage

For tighter integration, import Claimcheck directly as a library:

import { claimcheck } from 'claimcheck';

const { results, tokenUsage } = await claimcheck({
  claims: [
    {
      requirement: 'The counter value is always non-negative',
      lemmaName: 'CounterNonNegative',
      dafnyCode: 'lemma CounterNonNegative(m: int)\n  requires Inv(m)\n  ensures m >= 0\n{}',
    },
  ],
  domain: 'counter',
});

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

End-to-end workflow for auditing Dafny claims

The workflow cycle

Structuring the mapping file

Interpreting results

Acting on weakeningType categories

Running with —verify

Using —single-prompt for richer dispute analysis

Integration tips

Running in CI

Using —json for machine-readable output

Using stdin mode for programmatic use

Library usage

Build docs developers (and LLMs) love

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Documentation Index

​The workflow cycle

​Structuring the mapping file

​Interpreting results

​Acting on weakeningType categories

​Running with —verify

​Using —single-prompt for richer dispute analysis

​Integration tips

​Running in CI

​Using —json for machine-readable output

​Using stdin mode for programmatic use

​Library usage

Build docs developers (and LLMs) love

The workflow cycle

Structuring the mapping file

Interpreting results

Acting on weakeningType categories

Running with —verify

Using —single-prompt for richer dispute analysis

Integration tips

Running in CI

Using —json for machine-readable output

Using stdin mode for programmatic use

Library usage