Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/metareflection/claimcheck/llms.txt

Use this file to discover all available pages before exploring further.

The same two-pass pipeline that detects mismatches between Dafny lemmas and natural-language requirements applies directly to general claim verification against textual evidence. Given any (claim, evidence) pair, the pipeline summarizes the evidence without seeing the claim, then compares that summary to the claim — preventing the model from anchoring on the claim while reading the evidence. This benchmark suite measures whether that structural separation improves verdict accuracy across four datasets: SciFact (biomedical), FEVER (general facts), VitaminC (contrastive Wikipedia edits), and HealthVer (health and COVID-19).

Why two-pass prevents anchoring bias

When a model sees both a claim and evidence at the same time, it tends to read the evidence through the lens of the claim — selectively emphasizing details that confirm or deny it rather than judging what the evidence independently establishes. The two-pass approach closes this off structurally:
  1. Pass 1 (Summarize): The model reads the evidence without seeing the claim and produces a neutral summary of what the evidence establishes.
  2. Pass 2 (Compare): The model compares that summary to the claim and issues a verdict.
Two control modes test whether prompting alone is sufficient:
  • Baseline: Model sees claim and evidence together, judges directly (one call).
  • Single-prompt: Model sees both but is instructed to summarize first, then judge (one call). Tests whether a prompt instruction prevents anchoring without structural enforcement.
VitaminC is the sharpest test of anchoring resistance: its contrastive pairs are minimal edits to Wikipedia sentences that flip the correct label. A model that anchors on the claim can mistake a one-word change for a non-issue.

Datasets

DatasetDomainEntriesLabelsEvidence source
SciFactBiomedical321SUPPORTS / REFUTES / NEIResearch paper abstracts
FEVERGeneral facts9,999SUPPORTS / REFUTES / NEIWikipedia sentences
VitaminCGeneral facts63,054SUPPORTS / REFUTES / NEIWikipedia (contrastive pairs)
HealthVerHealth / COVID-193,740SUPPORTS / REFUTES / NEIPubMed evidence sentences
All four datasets use the same three-label scheme: SUPPORTS, REFUTES, and NOT_ENOUGH_INFO (NEI).

Modes

ModeAPI callsDescription
baseline1Claim and evidence together; model judges directly
single-prompt1Model instructed to summarize first, then judge
two-pass2Structural separation: summarize without claim, then compare

Downloading datasets

1

Download FEVER, VitaminC, and HealthVer

Run the provided download scripts. FEVER is the largest at ~1.6 GB.
bash data/download-fever.sh
bash data/download-vitaminc.sh
bash data/download-healthver.sh
VitaminC requires Python’s datasets library:
pip install datasets
2

Download SciFact

SciFact is not included in the download scripts. Fetch it directly from the SciFact S3 bucket:
mkdir -p data/scifact && cd data/scifact
curl -sL https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz | tar xz
The extracted data should land at data/scifact/data/.
3

Install Node dependencies

npm install

Running benchmarks

All four benchmark scripts share the same CLI interface:
node eval/bench-<name>.js --mode <mode> --label <label> [options]

SciFact examples

# Quick smoke test (5 entries)
node eval/bench-scifact.js --mode baseline --label test --limit 5

# Full comparison across all three modes
node eval/bench-scifact.js --mode baseline --label scifact-baseline
node eval/bench-scifact.js --mode single-prompt --label scifact-single
node eval/bench-scifact.js --mode two-pass --label scifact-two-pass

# Use a different model
node eval/bench-scifact.js --mode two-pass --model claude-opus-4-6 --label scifact-opus

# Compare two result files
node eval/compare-mystery.js scifact-baseline scifact-two-pass

FEVER, VitaminC, and HealthVer examples

# FEVER with default 500-entry sample
node eval/bench-fever.js --mode baseline --label fever-baseline
node eval/bench-fever.js --mode two-pass --label fever-two-pass

# Full FEVER (no sampling, ~10K entries)
node eval/bench-fever.js --mode baseline --label fever-full --sample 0

# VitaminC (contrastive — best test of anchoring)
node eval/bench-vitaminc.js --mode baseline --label vitaminc-baseline
node eval/bench-vitaminc.js --mode two-pass --label vitaminc-two-pass

# HealthVer (small enough to run in full)
node eval/bench-healthver.js --mode baseline --label healthver-baseline
node eval/bench-healthver.js --mode two-pass --label healthver-two-pass
Results are saved to eval/results/<label>.json. FEVER and VitaminC default to a 500-entry random sample (--seed 42). Pass --sample 0 to run the full dataset.

CLI options

FlagDefaultDescription
--modebaselineRun mode: baseline, single-prompt, or two-pass
--labelautoLabel used for the result file name
--modelclaude-sonnet-4-5-20250929Model ID to use
--backendapiapi (Anthropic API) or cc (Claude Code CLI)
--limit N0 (all)Maximum number of entries to run
--offset N0Skip the first N entries
--sample N500 (FEVER/VitaminC)Random sample size
--seed N42Random seed for sampling
--verboseoffPrint prompts and API response details

Build docs developers (and LLMs) love