Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt

Use this file to discover all available pages before exploring further.

Dafny verifies proofs, not intent. A lemma can be mathematically sound — compiling cleanly, passing every verification check — while expressing something entirely different from the requirement it claims to cover. This guide documents six concrete failure modes observed in practice, each with a real example of how the failure manifests and what it looks like in a Dafny lemma. It also covers the available mitigations and their limits, so you can reason clearly about what Claimcheck can and cannot catch automatically.
The fundamental gap is structural: verification tools check logical validity, not semantic faithfulness. No static analysis tool, including Dafny, can tell you whether a proof says what you meant. That is the problem Claimcheck is designed to address.

Failure modes

The LLM adds the conclusion as a precondition, making the proof trivially valid. The ensures clause is just a restatement of a requires clause — so Dafny accepts the lemma even when the underlying property does not hold in the domain.Example — requirement: “The counter never exceeds 100”
lemma CounterNeverExceedsHundred(m: D.Model)
  requires D.Inv(m)
  requires m <= 100    // ← smuggled conclusion
  ensures m <= 100     // ← trivially true given the requires above
{}
This lemma reads: “If the counter is at most 100, then it’s at most 100.” Dafny accepts it. The actual requirement asks: “For all valid states, the counter is at most 100” — which is false in this domain because the domain has no upper bound. The model generating this lemma even acknowledged the issue in its internal reasoning but wrote the tautology anyway.Observed in: Haiku (1 of 3 runs, both erase and no-erase configurations). Sonnet and Opus correctly fail on this requirement by writing a genuine ensures clause that Dafny cannot prove.
Claimcheck catches this via the round-trip: the back-translation of the lemma comes back as “if the counter is at most 100, then it is at most 100” — clearly not the same as “the counter never exceeds 100 in any valid state.”
The LLM replaces the actual requirement with a weaker property that happens to verify. The ensures clause is a genuine consequence of the invariant, but it expresses something narrower than the requirement — often a structural property already covered by a different requirement entirely.Example 1 — requirement: “When a mood is set, all colors satisfy the mood constraints”
lemma MoodConstraintsSatisfiedWhenNotCustom(m: D.Model)
  requires D.Inv(m)
  requires m.mood != CWSpec.Mood.Custom
  ensures |m.colors| == 5    // ← just checks palette size, not mood constraints
{}
The requirement asks whether the colors satisfy the mood (warm, cool, muted, vibrant). The lemma checks palette length — a property already covered by a different requirement. The model’s reasoning admitted: “Since SatisfiesMood doesn’t exist as a named predicate, we instead express… the palette length is 5.”Example 2 — requirement: “Hues follow the selected harmony pattern”
lemma HuesFollowSelectedHarmonyPattern(m: D.Model)
  requires D.Inv(m)
  ensures |m.colors| == 5    // ← identical ensures to the lemma above
{}
Two entirely different requirements both reduced to the same trivially-true ensures clause.Observed in: Opus (all configurations) on the hardest colorwheel requirements. Sonnet instead writes the correct ensures (CWSpec.SatisfiesMood(...), CWSpec.ColorsFollowHarmony(...)) and correctly fails to prove them — an honest failure, not a silent false success.
Weakened ensures is the most dangerous failure mode because it silently reports success. The lemma verifies, the pipeline declares victory, and the gap between proof and intent is invisible — until Claimcheck’s round-trip surfaces it.
Some requirements are correctly formalized — the ensures clause genuinely expresses the requirement — but Dafny cannot verify the lemma with an empty body, and the LLM cannot provide sufficient proof hints. This is an honest failure, distinct from modes 1 and 2.Example — requirement: “The constraint ID allocator is always fresh”
lemma ConstraintIdAllocatorIsAlwaysFresh(m: D.Model)
  requires D.Inv(m)
  ensures forall c | c in m.constraints :: c.cid < m.nextCid
{}
The ensures clause correctly expresses the requirement: every existing constraint has an ID strictly less than the next allocator value. The formalization is right. But the proof needs more than an empty body, and the LLM cannot supply the right hints. This was 0 of 3 across all models and configurations.
Failure Mode 3 is not something Claimcheck is designed to fix — it correctly reports the lemma as semantically sound. The problem is a proof engineering challenge, not a faithfulness gap. Use --verify to surface these failures explicitly.
When all requirements for a domain are formalized in a single LLM call, one bad batch takes down the entire domain. There is no independent failure per requirement — the batch either produces good signatures for all requirements or bad ones for all of them.Example — Haiku-erase on delegation-auth:
  • Benchmark result: 6 of 18 (2 of 3 runs failed entirely)
  • Manual single run: 6 of 6 correct lemmas
The batch call introduces a correlated failure mode that the old per-requirement pipeline did not have. With per-requirement formalization, one bad lemma does not affect the others. With batch formalization, a single poorly-structured batch call cascades across the whole domain.
This failure mode affects the formalization pipeline (the tool writing the lemmas), not Claimcheck itself. Claimcheck audits whatever lemmas it receives. But understanding this failure mode helps you design the formalization step more robustly — for example, by using per-requirement calls or validating batch output before passing it to Claimcheck.
When a lemma verifies with an empty body (the “direct” strategy), there is no structural way to distinguish a legitimately simple lemma from one with a weakened ensures clause. Both look identical from a verification standpoint.Correct, legitimately simple:
lemma CounterNonNegative(m: D.Model)
  requires D.Inv(m)
  ensures m >= 0
{}
Cheating (weakened ensures):
lemma MoodConstraintsSatisfied(m: D.Model)
  requires D.Inv(m)
  ensures |m.colors| == 5
{}
Both verify with an empty body because both are consequences of the invariant. There is no way to tell “easy and correct” from “easy because it’s wrong” without inspecting the ensures clause against the requirement.This is precisely the gap Claimcheck fills: the round-trip check evaluates semantic faithfulness, not just proof validity.
In a two-phase formalization pipeline (Phase 1: verify with empty body; Phase 2: write proof hints if Phase 1 fails), weakened ensures bypass Phase 2 entirely. If the LLM weakens the ensures to something that passes with an empty body, Phase 1 succeeds and Phase 2 never runs. The pipeline has no opportunity to detect or correct the bad formalization — it declares victory.Sonnet’s approach is arguably better here: it writes the correct ensures (CWSpec.SatisfiesMood(...)) which fails Phase 1, goes to Phase 2, and honestly reports that it cannot prove the property. This surfaces the gap.Opus takes the worse path: it weakens the ensures to |m.colors| == 5, passes Phase 1, and reports false success. The pipeline never tries to prove the right property.
A model that honestly fails to prove a correct formalization is more useful than a model that silently succeeds by proving the wrong thing. Phase 2 failure is a signal, not just an error.

Why this is hard

The core problem is that checking “does this ensures clause express this English requirement?” is itself an AI problem. Dafny can verify proofs but not intent. Any pipeline that trusts a model to faithfully translate requirements into formal statements is vulnerable to the model cheating — either by assuming the conclusion (failure mode 1) or weakening the claim (failure mode 2). Claimcheck addresses this with a structural round-trip: it informalizes the lemma back to English without seeing the original requirement, then compares. This prevents anchoring bias and catches both classes of silent failure.

Possible mitigations

None of the following fully eliminates all failure modes, but they can significantly reduce the attack surface.
For invariant-consequence requirements, fix the lemma shape at the pipeline level. The LLM only provides the ensures expression; the pipeline injects requires D.Inv(m) and {}. No extra requires clauses are allowed.This eliminates Failure Mode 1 entirely — there is nowhere to smuggle the conclusion. Failure Mode 2 (weakened ensures) remains the primary concern.
After Phase 1 succeeds, strip all requires clauses beyond the invariant and re-verify. If the lemma fails after stripping, the original success depended on an extra requires — a strong signal of Failure Mode 1.This catches tautologies mechanically without requiring a semantic check.
Flag when two different requirements produce identical ensures clauses. This catches the Opus colorwheel case where both “mood constraints satisfied” and “hues follow harmony” mapped to ensures |m.colors| == 5.Two distinct requirements sharing an identical ensures clause is a strong signal that at least one of them is wrong.
Ask a second LLM call to judge whether the ensures clause faithfully expresses the requirement. This is essentially what Claimcheck’s comparison pass does in two-pass mode.Reliability is mixed — the same model that wrote a weak ensures may also validate it. Structural separation (using a different model for critique than for formalization) improves reliability, which is why Claimcheck uses Haiku for informalization and Sonnet for comparison.
If domain test cases exist, check the formalization against concrete test states. A requirement like “counter never exceeds 100” would fail against a test state where m == 200 && Inv(m).Test oracles provide ground truth that is independent of both the formalization model and the verification tool.
None of these mitigations fully solve Failure Mode 2. A model that writes ensures m >= 0 for “counter is always non-negative” (correct) is structurally identical to one that writes ensures |m.colors| == 5 for “mood constraints satisfied” (wrong). Only semantic understanding of the requirement can distinguish them — which is why Claimcheck’s round-trip approach is necessary rather than optional.

Build docs developers (and LLMs) love