Running the Halgorithem benchmark suite and test scripts

Halgorithem ships with two benchmark approaches. The first is a unit benchmark (bench.py) that runs 8 pre-labeled claims about BASIC programming language history against local truth files, giving you an immediate accuracy percentage with no external dependencies. The second is an Engine-based real-world test (test.py) that calls an LLM to generate text, scrapes Wikipedia pages for source material, then runs the full hallucination detection pipeline end to end. Both benchmarks use the same compare_with_reasoning API under the hood.

Unit benchmark (bench.py)

The unit benchmark is self-contained: it reads source text from the sources/ directory and classifies 8 fixed claims, printing an accuracy report when finished.

Activate your virtual environment

Before running any benchmark, activate the virtual environment you created during installation.

source venv/bin/activate

Run the benchmark

From the project root, execute bench.py directly with Python.

python bench.py

The script loads sources/basic.txt and sources/basic2.txt, classifies each of the 8 test claims, and prints a report. A clean run with no mismatches looks like this:

Benchmark Report
================================================================================

================================================================================
Accuracy: 100.0%
================================================================================

Interpret the output

If any claim is misclassified, a block like the following appears between the two separator rows:

--------------------------------------------------------------------------------
Claim: BASIC was created by NASA
Expected: HALLUCINATION
Predicted: WEAK_SUPPORT
Score: 0.312

Each mismatch shows the claim text, what label was expected, what Halgorithem predicted, and the raw similarity score. Use these details to decide whether to adjust the threshold parameter.

The default threshold is 0.30. Lowering it makes Halgorithem more aggressive — more claims will be flagged as HALLUCINATION. Raising it makes the classifier more lenient — borderline claims move from HALLUCINATION to WEAK_SUPPORT. Start with 0.30 and tune based on the mismatch output.

How bench.py works

The benchmark instantiates Halgorithm with small chunks (2 sentences, 1 sentence overlap) and calls compare_with_reasoning for each claim:

from Halgorithem import Halgorithm

algo = Halgorithm(sentences_per_chunk=2, sentence_overlap=1)

results = algo.compare_with_reasoning(
    truth_file_paths=["sources/basic.txt", "sources/basic2.txt"],
    ai_output="BASIC was developed in 1964",
    threshold=0.30
)

compare_with_reasoning returns a list of result dicts — one per meaningful claim found in ai_output. Each dict contains at minimum a status field (SUPPORTED, WEAK_SUPPORT, HALLUCINATION, or CONTRADICTION) and a score field with the best cosine similarity found across all source chunks.

Source files used

File	Contents
`sources/basic.txt`	Primary reference text about BASIC programming language history
`sources/basic2.txt`	Secondary reference text with additional BASIC history details

Engine-based real-world test (test.py)

The test.py script runs a full end-to-end test: it sends a prompt to an LLM, scrapes two Apollo 11 reference pages, and verifies every claim in the generated response.

from engine import run

result = run(
    prompt="What was the Apollo 11 mission? ...",
    urls=[
        "https://en.wikipedia.org/wiki/Apollo_11",
        "https://www.britannica.com/event/Apollo-11"
    ],
    threshold=0.30
)
print(result["summary"])

The run function returns a dict with the following keys:

Key	Description
`ai_output`	The raw text generated by the LLM
`sources`	List of URLs that were scraped for source material
`summary`	Human-readable report of supported and flagged claims
`claims`	List of per-claim result dicts from `compare_with_reasoning`

Run it from the project root:

python test.py

The Engine-based test (test.py) requires a valid OPENAI_API_KEY environment variable to generate the AI output. Set it before running:

export OPENAI_API_KEY="sk-..."
python test.py

The unit benchmark (bench.py) has no such requirement and can be run offline.

Get Started

How It Works

Guides

Benchmarks & Results

Running the Halgorithem benchmark suite and test scripts

Unit benchmark (bench.py)

How bench.py works

Source files used

Engine-based real-world test (test.py)

Build docs developers (and LLMs) love

Get Started

How It Works

Guides

Benchmarks & Results

Documentation Index

​Unit benchmark (bench.py)

​How bench.py works

​Source files used

​Engine-based real-world test (test.py)

Build docs developers (and LLMs) love

Unit benchmark (bench.py)

How bench.py works

Source files used

Engine-based real-world test (test.py)