The same two-pass pipeline that detects mismatches between Dafny lemmas and natural-language requirements applies directly to general claim verification against textual evidence. Given any (claim, evidence) pair, the pipeline summarizes the evidence without seeing the claim, then compares that summary to the claim — preventing the model from anchoring on the claim while reading the evidence. This benchmark suite measures whether that structural separation improves verdict accuracy across four datasets: SciFact (biomedical), FEVER (general facts), VitaminC (contrastive Wikipedia edits), and HealthVer (health and COVID-19).Documentation Index
Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt
Use this file to discover all available pages before exploring further.
Why two-pass prevents anchoring bias
When a model sees both a claim and evidence at the same time, it tends to read the evidence through the lens of the claim — selectively emphasizing details that confirm or deny it rather than judging what the evidence independently establishes. The two-pass approach closes this off structurally:- Pass 1 (Summarize): The model reads the evidence without seeing the claim and produces a neutral summary of what the evidence establishes.
- Pass 2 (Compare): The model compares that summary to the claim and issues a verdict.
- Baseline: Model sees claim and evidence together, judges directly (one call).
- Single-prompt: Model sees both but is instructed to summarize first, then judge (one call). Tests whether a prompt instruction prevents anchoring without structural enforcement.
VitaminC is the sharpest test of anchoring resistance: its contrastive pairs are minimal edits to Wikipedia sentences that flip the correct label. A model that anchors on the claim can mistake a one-word change for a non-issue.
Datasets
| Dataset | Domain | Entries | Labels | Evidence source |
|---|---|---|---|---|
| SciFact | Biomedical | 321 | SUPPORTS / REFUTES / NEI | Research paper abstracts |
| FEVER | General facts | 9,999 | SUPPORTS / REFUTES / NEI | Wikipedia sentences |
| VitaminC | General facts | 63,054 | SUPPORTS / REFUTES / NEI | Wikipedia (contrastive pairs) |
| HealthVer | Health / COVID-19 | 3,740 | SUPPORTS / REFUTES / NEI | PubMed evidence sentences |
Modes
| Mode | API calls | Description |
|---|---|---|
baseline | 1 | Claim and evidence together; model judges directly |
single-prompt | 1 | Model instructed to summarize first, then judge |
two-pass | 2 | Structural separation: summarize without claim, then compare |
Downloading datasets
Download FEVER, VitaminC, and HealthVer
Run the provided download scripts. FEVER is the largest at ~1.6 GB.VitaminC requires Python’s
datasets library:Download SciFact
SciFact is not included in the download scripts. Fetch it directly from the SciFact S3 bucket:The extracted data should land at
data/scifact/data/.Running benchmarks
All four benchmark scripts share the same CLI interface:SciFact examples
FEVER, VitaminC, and HealthVer examples
Results are saved to
eval/results/<label>.json. FEVER and VitaminC default to a 500-entry random sample (--seed 42). Pass --sample 0 to run the full dataset.CLI options
| Flag | Default | Description |
|---|---|---|
--mode | baseline | Run mode: baseline, single-prompt, or two-pass |
--label | auto | Label used for the result file name |
--model | claude-sonnet-4-5-20250929 | Model ID to use |
--backend | api | api (Anthropic API) or cc (Claude Code CLI) |
--limit N | 0 (all) | Maximum number of entries to run |
--offset N | 0 | Skip the first N entries |
--sample N | 500 (FEVER/VitaminC) | Random sample size |
--seed N | 42 | Random seed for sampling |
--verbose | off | Print prompts and API response details |