Claimcheck was benchmarked against five Dafny test projects spanning 36 requirement-lemma pairs. To stress-test edge cases, 8 of those pairs are deliberately bogus: tautologies that prove trivially true statements, weakened postconditions that pass Dafny verification while understating the intended guarantee, and narrowed-scope lemmas that cover only a subset of the cases the requirement describes. The results show that structural separation — informalizing each lemma without seeing the requirement, then comparing — is both the most accurate approach and the fastest, beating the single-prompt and Claude Code baselines by wide margins.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt
Use this file to discover all available pages before exploring further.
Benchmark summary
- 5 domains: counter, kanban, colorwheel, canon, delegation-auth
- 36 requirement-lemma pairs across all domains
- 8 deliberately bogus lemmas — tautologies, weakened postconditions, and narrowed-scope claims designed to be plausible-looking but semantically wrong
Results
| Variant | Accuracy | Time/run | API calls/run |
|---|---|---|---|
| Two-pass (default) | 96.3% (104/108) | ~108s | 2 (batch informalize + batch compare) |
| Single-prompt | 86.1% (31/36) | ~353s | 36 (one per pair) |
Claude Code (bench-cc) | 69.4% (25/36) | ~693s | 36 (one claude -p per pair) |
Two-pass results are averaged over 3 runs. Single-prompt and Claude Code results are from 1 run each.
Key takeaways
- Structural separation wins. Two-pass is the most accurate mode because the informalization step never sees the requirement — it can’t anchor on it. The comparison step then works from an independent summary, not a biased reading.
- Batching gives a 3x speed advantage. Two-pass completes in ~108s using only 2 API calls (one batch informalize, one batch compare), versus ~353s for single-prompt at 36 individual calls.
- Prompting alone is not enough. Single-prompt instructs the model to summarize before judging, but without structural enforcement the model still sees both inputs. Accuracy drops 10 percentage points.
- General-purpose tooling hurts accuracy and speed. Claude Code’s system prompt and lack of structured output cost it 27 percentage points compared to two-pass, and it runs slowest despite the same number of API calls.
Test domains
| Project | Claims file | Domain module |
|---|---|---|
| counter | test/integration/claims/counter.dfy | CounterDomain |
| kanban | test/integration/claims/kanban.dfy | KanbanDomain |
| colorwheel | test/integration/claims/colorwheel.dfy | ColorWheelDomain |
| canon | test/integration/claims/canon.dfy | CanonDomain |
| delegation-auth | test/integration/claims/delegation-auth.dfy | DelegationAuthDomain |
../dafny-replay/. Mappings live in test/integration/mappings/.
Running the benchmarks
Run benchmark variants
Run each variant with the desired number of runs and a label for the result file:
Demo case study: election tally
This walkthrough uses a verified election tally system — aCount function with 11 lemmas covering bounds, vote effects, aggregation, unanimity, and monotonicity — to show what claimcheck catches in practice.
Starting point: Claude Code generates the lemmas, Dafny verifies all 14 proofs with 0 errors. Every proof is mathematically correct.
Claimcheck audit: 9 confirmed, 2 disputed.
Catch 1 — OrderIrrelevant (narrowed scope)
- Requirement: “Order of ballots doesn’t affect the tally”
- Lemma: Proves only that prepend and append give the same count
- Problem: That is one specific reordering — not any permutation. The lemma looks right at a glance and Dafny has no opinion on the gap. Claimcheck flags it.
Catch 2 — TallyMonotonic (tautology / wrong property)
- Requirement: “Adding a ballot can’t decrease a tally”
- Lemma: Proves
Count(...) >= 0, which is trivially true since counts are natural numbers - Problem: The lemma should compare the tally before and after adding a ballot. Instead it checks non-negativity — a tautology masquerading as monotonicity.
OrderIrrelevant was rewritten using multisets for full permutation invariance; the tautology was removed. One re-run: 10 confirmed, 0 disputed.