Fact-checking benchmarks: SciFact, FEVER, VitaminC

The same two-pass pipeline that detects mismatches between Dafny lemmas and natural-language requirements applies directly to general claim verification against textual evidence. Given any (claim, evidence) pair, the pipeline summarizes the evidence without seeing the claim, then compares that summary to the claim — preventing the model from anchoring on the claim while reading the evidence. This benchmark suite measures whether that structural separation improves verdict accuracy across four datasets: SciFact (biomedical), FEVER (general facts), VitaminC (contrastive Wikipedia edits), and HealthVer (health and COVID-19).

Why two-pass prevents anchoring bias

When a model sees both a claim and evidence at the same time, it tends to read the evidence through the lens of the claim — selectively emphasizing details that confirm or deny it rather than judging what the evidence independently establishes. The two-pass approach closes this off structurally:

Pass 1 (Summarize): The model reads the evidence without seeing the claim and produces a neutral summary of what the evidence establishes.
Pass 2 (Compare): The model compares that summary to the claim and issues a verdict.

Two control modes test whether prompting alone is sufficient:

Baseline: Model sees claim and evidence together, judges directly (one call).
Single-prompt: Model sees both but is instructed to summarize first, then judge (one call). Tests whether a prompt instruction prevents anchoring without structural enforcement.

VitaminC is the sharpest test of anchoring resistance: its contrastive pairs are minimal edits to Wikipedia sentences that flip the correct label. A model that anchors on the claim can mistake a one-word change for a non-issue.

Datasets

Dataset	Domain	Entries	Labels	Evidence source
SciFact	Biomedical	321	SUPPORTS / REFUTES / NEI	Research paper abstracts
FEVER	General facts	9,999	SUPPORTS / REFUTES / NEI	Wikipedia sentences
VitaminC	General facts	63,054	SUPPORTS / REFUTES / NEI	Wikipedia (contrastive pairs)
HealthVer	Health / COVID-19	3,740	SUPPORTS / REFUTES / NEI	PubMed evidence sentences

All four datasets use the same three-label scheme: SUPPORTS, REFUTES, and NOT_ENOUGH_INFO (NEI).

Modes

Mode	API calls	Description
`baseline`	1	Claim and evidence together; model judges directly
`single-prompt`	1	Model instructed to summarize first, then judge
`two-pass`	2	Structural separation: summarize without claim, then compare

Downloading datasets

Download FEVER, VitaminC, and HealthVer

Run the provided download scripts. FEVER is the largest at ~1.6 GB.

bash data/download-fever.sh
bash data/download-vitaminc.sh
bash data/download-healthver.sh

VitaminC requires Python’s datasets library:

pip install datasets

Download SciFact

SciFact is not included in the download scripts. Fetch it directly from the SciFact S3 bucket:

mkdir -p data/scifact && cd data/scifact
curl -sL https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz | tar xz

The extracted data should land at data/scifact/data/.

Install Node dependencies

npm install

Running benchmarks

All four benchmark scripts share the same CLI interface:

node eval/bench-<name>.js --mode <mode> --label <label> [options]

SciFact examples

# Quick smoke test (5 entries)
node eval/bench-scifact.js --mode baseline --label test --limit 5

# Full comparison across all three modes
node eval/bench-scifact.js --mode baseline --label scifact-baseline
node eval/bench-scifact.js --mode single-prompt --label scifact-single
node eval/bench-scifact.js --mode two-pass --label scifact-two-pass

# Use a different model
node eval/bench-scifact.js --mode two-pass --model claude-opus-4-6 --label scifact-opus

# Compare two result files
node eval/compare-mystery.js scifact-baseline scifact-two-pass

FEVER, VitaminC, and HealthVer examples

# FEVER with default 500-entry sample
node eval/bench-fever.js --mode baseline --label fever-baseline
node eval/bench-fever.js --mode two-pass --label fever-two-pass

# Full FEVER (no sampling, ~10K entries)
node eval/bench-fever.js --mode baseline --label fever-full --sample 0

# VitaminC (contrastive — best test of anchoring)
node eval/bench-vitaminc.js --mode baseline --label vitaminc-baseline
node eval/bench-vitaminc.js --mode two-pass --label vitaminc-two-pass

# HealthVer (small enough to run in full)
node eval/bench-healthver.js --mode baseline --label healthver-baseline
node eval/bench-healthver.js --mode two-pass --label healthver-two-pass

Results are saved to eval/results/<label>.json. FEVER and VitaminC default to a 500-entry random sample (--seed 42). Pass --sample 0 to run the full dataset.

CLI options

Flag	Default	Description
`--mode`	`baseline`	Run mode: `baseline`, `single-prompt`, or `two-pass`
`--label`	auto	Label used for the result file name
`--model`	`claude-sonnet-4-5-20250929`	Model ID to use
`--backend`	`api`	`api` (Anthropic API) or `cc` (Claude Code CLI)
`--limit N`	0 (all)	Maximum number of entries to run
`--offset N`	0	Skip the first N entries
`--sample N`	500 (FEVER/VitaminC)	Random sample size
`--seed N`	42	Random seed for sampling
`--verbose`	off	Print prompts and API response details

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Fact-checking benchmarks: SciFact, FEVER, VitaminC

Why two-pass prevents anchoring bias

Datasets

Modes

Downloading datasets

Running benchmarks

SciFact examples

FEVER, VitaminC, and HealthVer examples

CLI options

Build docs developers (and LLMs) love

Get Started

How It Works

CLI Reference

Library & Backends

Benchmarks

Guides

Documentation Index

​Why two-pass prevents anchoring bias

​Datasets

​Modes

​Downloading datasets

​Running benchmarks

​SciFact examples

​FEVER, VitaminC, and HealthVer examples

​CLI options

Build docs developers (and LLMs) love

Why two-pass prevents anchoring bias

Datasets

Modes

Downloading datasets

Running benchmarks

SciFact examples

FEVER, VitaminC, and HealthVer examples

CLI options