Halgorithem ships with two benchmark approaches. The first is a unit benchmark (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/TangibleResearch/Halgorithem/llms.txt
Use this file to discover all available pages before exploring further.
bench.py) that runs 8 pre-labeled claims about BASIC programming language history against local truth files, giving you an immediate accuracy percentage with no external dependencies. The second is an Engine-based real-world test (test.py) that calls an LLM to generate text, scrapes Wikipedia pages for source material, then runs the full hallucination detection pipeline end to end. Both benchmarks use the same compare_with_reasoning API under the hood.
Unit benchmark (bench.py)
The unit benchmark is self-contained: it reads source text from thesources/ directory and classifies 8 fixed claims, printing an accuracy report when finished.
Activate your virtual environment
Before running any benchmark, activate the virtual environment you created during installation.
Run the benchmark
From the project root, execute The script loads
bench.py directly with Python.sources/basic.txt and sources/basic2.txt, classifies each of the 8 test claims, and prints a report. A clean run with no mismatches looks like this:Interpret the output
If any claim is misclassified, a block like the following appears between the two separator rows:Each mismatch shows the claim text, what label was expected, what Halgorithem predicted, and the raw similarity score. Use these details to decide whether to adjust the
threshold parameter.How bench.py works
The benchmark instantiatesHalgorithm with small chunks (2 sentences, 1 sentence overlap) and calls compare_with_reasoning for each claim:
compare_with_reasoning returns a list of result dicts — one per meaningful claim found in ai_output. Each dict contains at minimum a status field (SUPPORTED, WEAK_SUPPORT, HALLUCINATION, or CONTRADICTION) and a score field with the best cosine similarity found across all source chunks.
Source files used
| File | Contents |
|---|---|
sources/basic.txt | Primary reference text about BASIC programming language history |
sources/basic2.txt | Secondary reference text with additional BASIC history details |
Engine-based real-world test (test.py)
Thetest.py script runs a full end-to-end test: it sends a prompt to an LLM, scrapes two Apollo 11 reference pages, and verifies every claim in the generated response.
run function returns a dict with the following keys:
| Key | Description |
|---|---|
ai_output | The raw text generated by the LLM |
sources | List of URLs that were scraped for source material |
summary | Human-readable report of supported and flagged claims |
claims | List of per-claim result dicts from compare_with_reasoning |
The Engine-based test (The unit benchmark (
test.py) requires a valid OPENAI_API_KEY environment variable to generate the AI output. Set it before running:bench.py) has no such requirement and can be run offline.