Dafny verifies proofs, not intent. A lemma can be mathematically sound — compiling cleanly, passing every verification check — while expressing something entirely different from the requirement it claims to cover. This guide documents six concrete failure modes observed in practice, each with a real example of how the failure manifests and what it looks like in a Dafny lemma. It also covers the available mitigations and their limits, so you can reason clearly about what Claimcheck can and cannot catch automatically.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt
Use this file to discover all available pages before exploring further.
Failure modes
Failure Mode 1: Tautology via extra requires
Failure Mode 1: Tautology via extra requires
ensures clause is just a restatement of a requires clause — so Dafny accepts the lemma even when the underlying property does not hold in the domain.Example — requirement: “The counter never exceeds 100”ensures clause that Dafny cannot prove.Failure Mode 2: Weakened ensures
Failure Mode 2: Weakened ensures
ensures clause is a genuine consequence of the invariant, but it expresses something narrower than the requirement — often a structural property already covered by a different requirement entirely.Example 1 — requirement: “When a mood is set, all colors satisfy the mood constraints”SatisfiesMood doesn’t exist as a named predicate, we instead express… the palette length is 5.”Example 2 — requirement: “Hues follow the selected harmony pattern”ensures clause.Observed in: Opus (all configurations) on the hardest colorwheel requirements. Sonnet instead writes the correct ensures (CWSpec.SatisfiesMood(...), CWSpec.ColorsFollowHarmony(...)) and correctly fails to prove them — an honest failure, not a silent false success.Failure Mode 3: Correct formalization, proof is hard
Failure Mode 3: Correct formalization, proof is hard
ensures clause genuinely expresses the requirement — but Dafny cannot verify the lemma with an empty body, and the LLM cannot provide sufficient proof hints. This is an honest failure, distinct from modes 1 and 2.Example — requirement: “The constraint ID allocator is always fresh”ensures clause correctly expresses the requirement: every existing constraint has an ID strictly less than the next allocator value. The formalization is right. But the proof needs more than an empty body, and the LLM cannot supply the right hints. This was 0 of 3 across all models and configurations.--verify to surface these failures explicitly.Failure Mode 4: Nondeterministic batch formalization
Failure Mode 4: Nondeterministic batch formalization
- Benchmark result: 6 of 18 (2 of 3 runs failed entirely)
- Manual single run: 6 of 6 correct lemmas
Failure Mode 5: Direct strategy conflates trivial and cheating
Failure Mode 5: Direct strategy conflates trivial and cheating
ensures clause. Both look identical from a verification standpoint.Correct, legitimately simple:ensures clause against the requirement.This is precisely the gap Claimcheck fills: the round-trip check evaluates semantic faithfulness, not just proof validity.Failure Mode 6: Phase 2 never fires for weakened ensures
Failure Mode 6: Phase 2 never fires for weakened ensures
ensures to something that passes with an empty body, Phase 1 succeeds and Phase 2 never runs. The pipeline has no opportunity to detect or correct the bad formalization — it declares victory.Sonnet’s approach is arguably better here: it writes the correct ensures (CWSpec.SatisfiesMood(...)) which fails Phase 1, goes to Phase 2, and honestly reports that it cannot prove the property. This surfaces the gap.Opus takes the worse path: it weakens the ensures to |m.colors| == 5, passes Phase 1, and reports false success. The pipeline never tries to prove the right property.Why this is hard
The core problem is that checking “does thisensures clause express this English requirement?” is itself an AI problem. Dafny can verify proofs but not intent. Any pipeline that trusts a model to faithfully translate requirements into formal statements is vulnerable to the model cheating — either by assuming the conclusion (failure mode 1) or weakening the claim (failure mode 2).
Claimcheck addresses this with a structural round-trip: it informalizes the lemma back to English without seeing the original requirement, then compares. This prevents anchoring bias and catches both classes of silent failure.
Possible mitigations
None of the following fully eliminates all failure modes, but they can significantly reduce the attack surface.Structural constraint (strongest for tautologies)
Structural constraint (strongest for tautologies)
ensures expression; the pipeline injects requires D.Inv(m) and {}. No extra requires clauses are allowed.This eliminates Failure Mode 1 entirely — there is nowhere to smuggle the conclusion. Failure Mode 2 (weakened ensures) remains the primary concern.Strip-and-re-verify
Strip-and-re-verify
requires clauses beyond the invariant and re-verify. If the lemma fails after stripping, the original success depended on an extra requires — a strong signal of Failure Mode 1.This catches tautologies mechanically without requiring a semantic check.Duplicate ensures detection
Duplicate ensures detection
ensures clauses. This catches the Opus colorwheel case where both “mood constraints satisfied” and “hues follow harmony” mapped to ensures |m.colors| == 5.Two distinct requirements sharing an identical ensures clause is a strong signal that at least one of them is wrong.LLM self-critique
LLM self-critique
ensures clause faithfully expresses the requirement. This is essentially what Claimcheck’s comparison pass does in two-pass mode.Reliability is mixed — the same model that wrote a weak ensures may also validate it. Structural separation (using a different model for critique than for formalization) improves reliability, which is why Claimcheck uses Haiku for informalization and Sonnet for comparison.Test oracles
Test oracles
m == 200 && Inv(m).Test oracles provide ground truth that is independent of both the formalization model and the verification tool.