Benchmark results: accuracy across five domains

Claimcheck was benchmarked against five Dafny test projects spanning 36 requirement-lemma pairs. To stress-test edge cases, 8 of those pairs are deliberately bogus: tautologies that prove trivially true statements, weakened postconditions that pass Dafny verification while understating the intended guarantee, and narrowed-scope lemmas that cover only a subset of the cases the requirement describes. The results show that structural separation — informalizing each lemma without seeing the requirement, then comparing — is both the most accurate approach and the fastest, beating the single-prompt and Claude Code baselines by wide margins.

Benchmark summary

5 domains: counter, kanban, colorwheel, canon, delegation-auth
36 requirement-lemma pairs across all domains
8 deliberately bogus lemmas — tautologies, weakened postconditions, and narrowed-scope claims designed to be plausible-looking but semantically wrong

Results

Variant	Accuracy	Time/run	API calls/run
Two-pass (default)	96.3% (104/108)	~108s	2 (batch informalize + batch compare)
Single-prompt	86.1% (31/36)	~353s	36 (one per pair)
Claude Code (`bench-cc`)	69.4% (25/36)	~693s	36 (one `claude -p` per pair)

Two-pass results are averaged over 3 runs. Single-prompt and Claude Code results are from 1 run each.

Key takeaways

Structural separation wins. Two-pass is the most accurate mode because the informalization step never sees the requirement — it can’t anchor on it. The comparison step then works from an independent summary, not a biased reading.
Batching gives a 3x speed advantage. Two-pass completes in ~108s using only 2 API calls (one batch informalize, one batch compare), versus ~353s for single-prompt at 36 individual calls.
Prompting alone is not enough. Single-prompt instructs the model to summarize before judging, but without structural enforcement the model still sees both inputs. Accuracy drops 10 percentage points.
General-purpose tooling hurts accuracy and speed. Claude Code’s system prompt and lack of structured output cost it 27 percentage points compared to two-pass, and it runs slowest despite the same number of API calls.

Test domains

Project	Claims file	Domain module
counter	`test/integration/claims/counter.dfy`	CounterDomain
kanban	`test/integration/claims/kanban.dfy`	KanbanDomain
colorwheel	`test/integration/claims/colorwheel.dfy`	ColorWheelDomain
canon	`test/integration/claims/canon.dfy`	CanonDomain
delegation-auth	`test/integration/claims/delegation-auth.dfy`	DelegationAuthDomain

Each claims file includes its domain from ../dafny-replay/. Mappings live in test/integration/mappings/.

Running the benchmarks

Run all test projects

Run all five domains at once, or pass a domain name to run a single project:

node test/integration/run-all.js
node test/integration/run-all.js counter

Run benchmark variants

Run each variant with the desired number of runs and a label for the result file:

node eval/bench.js --runs 3 --label two-pass
node eval/bench.js --runs 3 --label single-prompt --single-prompt
node eval/bench-cc.js --runs 1 --label cc-sonnet

Run a single domain or lemma

Scope a benchmark run to one domain or a single lemma:

node eval/bench-cc.js --runs 1 --label test --domain counter
node eval/bench-cc.js --runs 1 --label test --domain counter --lemma CounterNonNegative

Compare results

Compare two labeled result files side by side:

node eval/compare.js two-pass single-prompt
node eval/compare.js two-pass cc-sonnet

Demo case study: election tally

This walkthrough uses a verified election tally system — a Count function with 11 lemmas covering bounds, vote effects, aggregation, unanimity, and monotonicity — to show what claimcheck catches in practice. Starting point: Claude Code generates the lemmas, Dafny verifies all 14 proofs with 0 errors. Every proof is mathematically correct. Claimcheck audit: 9 confirmed, 2 disputed.

Catch 1 — `OrderIrrelevant` (narrowed scope)

Requirement: “Order of ballots doesn’t affect the tally”
Lemma: Proves only that prepend and append give the same count
Problem: That is one specific reordering — not any permutation. The lemma looks right at a glance and Dafny has no opinion on the gap. Claimcheck flags it.

Catch 2 — `TallyMonotonic` (tautology / wrong property)

Requirement: “Adding a ballot can’t decrease a tally”
Lemma: Proves Count(...) >= 0, which is trivially true since counts are natural numbers
Problem: The lemma should compare the tally before and after adding a ballot. Instead it checks non-negativity — a tautology masquerading as monotonicity.

After fixes: OrderIrrelevant was rewritten using multisets for full permutation invariance; the tautology was removed. One re-run: 10 confirmed, 0 disputed.

The election tally demo illustrates the core value: Dafny proves your code matches your spec. Claimcheck asks whether your spec matches your intent. The gap between them is where bugs hide.

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Benchmark results: accuracy across five domains

Benchmark summary

Results

Key takeaways

Test domains

Running the benchmarks

Demo case study: election tally

Catch 1 — `OrderIrrelevant` (narrowed scope)

Catch 2 — `TallyMonotonic` (tautology / wrong property)

Build docs developers (and LLMs) love

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Documentation Index

​Benchmark summary

​Results

​Key takeaways

​Test domains

​Running the benchmarks

​Demo case study: election tally

​Catch 1 — OrderIrrelevant (narrowed scope)

​Catch 2 — TallyMonotonic (tautology / wrong property)

Build docs developers (and LLMs) love

Benchmark summary

Results

Key takeaways

Test domains

Running the benchmarks

Demo case study: election tally

Catch 1 — `OrderIrrelevant` (narrowed scope)

Catch 2 — `TallyMonotonic` (tautology / wrong property)