Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt

Use this file to discover all available pages before exploring further.

Claimcheck was benchmarked against five Dafny test projects spanning 36 requirement-lemma pairs. To stress-test edge cases, 8 of those pairs are deliberately bogus: tautologies that prove trivially true statements, weakened postconditions that pass Dafny verification while understating the intended guarantee, and narrowed-scope lemmas that cover only a subset of the cases the requirement describes. The results show that structural separation — informalizing each lemma without seeing the requirement, then comparing — is both the most accurate approach and the fastest, beating the single-prompt and Claude Code baselines by wide margins.

Benchmark summary

  • 5 domains: counter, kanban, colorwheel, canon, delegation-auth
  • 36 requirement-lemma pairs across all domains
  • 8 deliberately bogus lemmas — tautologies, weakened postconditions, and narrowed-scope claims designed to be plausible-looking but semantically wrong

Results

VariantAccuracyTime/runAPI calls/run
Two-pass (default)96.3% (104/108)~108s2 (batch informalize + batch compare)
Single-prompt86.1% (31/36)~353s36 (one per pair)
Claude Code (bench-cc)69.4% (25/36)~693s36 (one claude -p per pair)
Two-pass results are averaged over 3 runs. Single-prompt and Claude Code results are from 1 run each.

Key takeaways

  • Structural separation wins. Two-pass is the most accurate mode because the informalization step never sees the requirement — it can’t anchor on it. The comparison step then works from an independent summary, not a biased reading.
  • Batching gives a 3x speed advantage. Two-pass completes in ~108s using only 2 API calls (one batch informalize, one batch compare), versus ~353s for single-prompt at 36 individual calls.
  • Prompting alone is not enough. Single-prompt instructs the model to summarize before judging, but without structural enforcement the model still sees both inputs. Accuracy drops 10 percentage points.
  • General-purpose tooling hurts accuracy and speed. Claude Code’s system prompt and lack of structured output cost it 27 percentage points compared to two-pass, and it runs slowest despite the same number of API calls.

Test domains

ProjectClaims fileDomain module
countertest/integration/claims/counter.dfyCounterDomain
kanbantest/integration/claims/kanban.dfyKanbanDomain
colorwheeltest/integration/claims/colorwheel.dfyColorWheelDomain
canontest/integration/claims/canon.dfyCanonDomain
delegation-authtest/integration/claims/delegation-auth.dfyDelegationAuthDomain
Each claims file includes its domain from ../dafny-replay/. Mappings live in test/integration/mappings/.

Running the benchmarks

1

Run all test projects

Run all five domains at once, or pass a domain name to run a single project:
node test/integration/run-all.js
node test/integration/run-all.js counter
2

Run benchmark variants

Run each variant with the desired number of runs and a label for the result file:
node eval/bench.js --runs 3 --label two-pass
node eval/bench.js --runs 3 --label single-prompt --single-prompt
node eval/bench-cc.js --runs 1 --label cc-sonnet
3

Run a single domain or lemma

Scope a benchmark run to one domain or a single lemma:
node eval/bench-cc.js --runs 1 --label test --domain counter
node eval/bench-cc.js --runs 1 --label test --domain counter --lemma CounterNonNegative
4

Compare results

Compare two labeled result files side by side:
node eval/compare.js two-pass single-prompt
node eval/compare.js two-pass cc-sonnet

Demo case study: election tally

This walkthrough uses a verified election tally system — a Count function with 11 lemmas covering bounds, vote effects, aggregation, unanimity, and monotonicity — to show what claimcheck catches in practice. Starting point: Claude Code generates the lemmas, Dafny verifies all 14 proofs with 0 errors. Every proof is mathematically correct. Claimcheck audit: 9 confirmed, 2 disputed.

Catch 1 — OrderIrrelevant (narrowed scope)

  • Requirement: “Order of ballots doesn’t affect the tally”
  • Lemma: Proves only that prepend and append give the same count
  • Problem: That is one specific reordering — not any permutation. The lemma looks right at a glance and Dafny has no opinion on the gap. Claimcheck flags it.

Catch 2 — TallyMonotonic (tautology / wrong property)

  • Requirement: “Adding a ballot can’t decrease a tally”
  • Lemma: Proves Count(...) >= 0, which is trivially true since counts are natural numbers
  • Problem: The lemma should compare the tally before and after adding a ballot. Instead it checks non-negativity — a tautology masquerading as monotonicity.
After fixes: OrderIrrelevant was rewritten using multisets for full permutation invariance; the tautology was removed. One re-run: 10 confirmed, 0 disputed.
The election tally demo illustrates the core value: Dafny proves your code matches your spec. Claimcheck asks whether your spec matches your intent. The gap between them is where bugs hide.

Build docs developers (and LLMs) love