Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TangibleResearch/Halgorithem/llms.txt

Use this file to discover all available pages before exploring further.

Halgorithem ships with two benchmark approaches. The first is a unit benchmark (bench.py) that runs 8 pre-labeled claims about BASIC programming language history against local truth files, giving you an immediate accuracy percentage with no external dependencies. The second is an Engine-based real-world test (test.py) that calls an LLM to generate text, scrapes Wikipedia pages for source material, then runs the full hallucination detection pipeline end to end. Both benchmarks use the same compare_with_reasoning API under the hood.

Unit benchmark (bench.py)

The unit benchmark is self-contained: it reads source text from the sources/ directory and classifies 8 fixed claims, printing an accuracy report when finished.
1

Activate your virtual environment

Before running any benchmark, activate the virtual environment you created during installation.
source venv/bin/activate
2

Run the benchmark

From the project root, execute bench.py directly with Python.
python bench.py
The script loads sources/basic.txt and sources/basic2.txt, classifies each of the 8 test claims, and prints a report. A clean run with no mismatches looks like this:
Benchmark Report
================================================================================

================================================================================
Accuracy: 100.0%
================================================================================
3

Interpret the output

If any claim is misclassified, a block like the following appears between the two separator rows:
--------------------------------------------------------------------------------
Claim: BASIC was created by NASA
Expected: HALLUCINATION
Predicted: WEAK_SUPPORT
Score: 0.312
Each mismatch shows the claim text, what label was expected, what Halgorithem predicted, and the raw similarity score. Use these details to decide whether to adjust the threshold parameter.
The default threshold is 0.30. Lowering it makes Halgorithem more aggressive — more claims will be flagged as HALLUCINATION. Raising it makes the classifier more lenient — borderline claims move from HALLUCINATION to WEAK_SUPPORT. Start with 0.30 and tune based on the mismatch output.

How bench.py works

The benchmark instantiates Halgorithm with small chunks (2 sentences, 1 sentence overlap) and calls compare_with_reasoning for each claim:
from Halgorithem import Halgorithm

algo = Halgorithm(sentences_per_chunk=2, sentence_overlap=1)

results = algo.compare_with_reasoning(
    truth_file_paths=["sources/basic.txt", "sources/basic2.txt"],
    ai_output="BASIC was developed in 1964",
    threshold=0.30
)
compare_with_reasoning returns a list of result dicts — one per meaningful claim found in ai_output. Each dict contains at minimum a status field (SUPPORTED, WEAK_SUPPORT, HALLUCINATION, or CONTRADICTION) and a score field with the best cosine similarity found across all source chunks.

Source files used

FileContents
sources/basic.txtPrimary reference text about BASIC programming language history
sources/basic2.txtSecondary reference text with additional BASIC history details

Engine-based real-world test (test.py)

The test.py script runs a full end-to-end test: it sends a prompt to an LLM, scrapes two Apollo 11 reference pages, and verifies every claim in the generated response.
from engine import run

result = run(
    prompt="What was the Apollo 11 mission? ...",
    urls=[
        "https://en.wikipedia.org/wiki/Apollo_11",
        "https://www.britannica.com/event/Apollo-11"
    ],
    threshold=0.30
)
print(result["summary"])
The run function returns a dict with the following keys:
KeyDescription
ai_outputThe raw text generated by the LLM
sourcesList of URLs that were scraped for source material
summaryHuman-readable report of supported and flagged claims
claimsList of per-claim result dicts from compare_with_reasoning
Run it from the project root:
python test.py
The Engine-based test (test.py) requires a valid OPENAI_API_KEY environment variable to generate the AI output. Set it before running:
export OPENAI_API_KEY="sk-..."
python test.py
The unit benchmark (bench.py) has no such requirement and can be run offline.

Build docs developers (and LLMs) love