Offline metrics: detecting hallucination and drift

Offline metrics evaluate a response that has already been generated — they do not require hooking into the LLM generation process. You provide a query, an answer, and the retrieved documents, and TrustifAI scores the answer across four orthogonal dimensions of trustworthiness. Each metric is independent, configurable, and can be individually enabled or disabled based on your use case.

Metric overview

Metric	What it measures	Default weight	Primary signal
Evidence Coverage	Factual grounding of every claim	0.40	Hallucination detection
Semantic Drift	Topical alignment with source documents	0.30	Topic drift detection
Epistemic Consistency	Stability across re-generations	0.20	Model inconsistency
Source Diversity	Breadth of sources used	0.10	Over-reliance on single source

Evidence Coverage

Evidence Coverage checks whether every claim in the response is actually supported by the retrieved documents. It is the most heavily weighted metric by default (0.40) because unsupported claims are the most direct form of hallucination. How it works: The answer is passed to an LLM alongside the retrieved documents and the original query. The LLM performs sentence-level entailment checking — it breaks the answer into individual sentences and labels each one as supported: true or supported: false relative to the document content. The final score is the fraction of supported sentences.

Evidence Coverage score = supported_sentences / total_sentences

This LLM-based NLI (Natural Language Inference) approach is more robust than embedding similarity alone because it reasons about meaning, not just vector proximity.

Threshold labels

Score	Label	Interpretation
≥ 0.85	`Strong Grounding`	All or nearly all claims are document-supported
≥ 0.60	`Partial Grounding`	Some claims may not be fully supported
< 0.60	`Likely Hallucinated Answer`	Many claims lack source support

Configuration

metrics:
  - type: "evidence_coverage"
    enabled: true
    params:
      strategy: "llm"        # LLM-based entailment (default)
      STRONG_GROUNDING: 0.85
      PARTIAL_GROUNDING: 0.60

Result details

{
    "score": 0.83,
    "label": "Partial Grounding",
    "details": {
        "total_sentences": 6,
        "supported_sentences": 5,
        "unsupported_sentences": ["The company was founded in 1998."],
        "failed_checks": 0
    }
}

Semantic Drift

Semantic Drift measures how closely the response stays within the semantic envelope of the retrieved documents. A high-scoring response is topically close to the source material; a low-scoring one has drifted into territory not covered by the documents. How it works: TrustifAI splits each document into individual sentences and embeds them all. It then computes the cosine similarity between the answer embedding (and query embedding) and each sentence embedding, taking the best match. The final score reflects the peak semantic alignment between the answer and the document corpus.

Semantic Drift score = max cosine_similarity(answer_emb, sentence_emb_i)
                       across all document sentences

Semantic Drift uses embedding similarity, not LLM reasoning. It catches cases where the response is topically off-base, but will not catch factual errors within the same topic. Use it alongside Evidence Coverage for full hallucination coverage.

Threshold labels

Score	Label	Interpretation
≥ 0.85	`Strong Alignment`	Answer is semantically well-anchored in the documents
≥ 0.60	`Partial Alignment`	Some claims may not align with source documents
< 0.60	`Likely Hallucinated Answer`	Answer diverges significantly from source content

Configuration

metrics:
  - type: "semantic_drift"
    enabled: true
    params:
      STRONG_ALIGNMENT: 0.85
      PARTIAL_ALIGNMENT: 0.60

Result details

{
    "score": 0.91,
    "label": "Strong Alignment",
    "details": {
        "total_documents": 3,
        "total_sentences_checked": 24,
        "best_matching_sentence": "New Delhi is the capital of India.",
        "explanation": "Answer semantically aligned with source documents."
    }
}

Epistemic Consistency

Epistemic Consistency measures how stable the LLM’s responses are when asked the same question multiple times with different random seeds. Hallucinated answers tend to vary wildly between runs — a model that is genuinely confident produces semantically similar responses even under stochastic conditions. How it works: TrustifAI generates k additional responses to the same query at elevated temperatures (randomly sampled from [0.7, 0.8, 0.9, 1.0]). Each sample is embedded and compared to the original answer using cosine similarity. The final score is the mean cosine similarity across all samples — equivalent to 1 - σ in the semantic space.

Epistemic Consistency score = mean(cosine_similarity(original_answer, sample_i))
                               for i in 1..k

The number of samples k is controlled by k_samples in your config. Setting k_samples: 0 skips generation and assumes full consistency (score = 1.0), which is useful for cost-sensitive pipelines.

Epistemic Consistency makes k additional LLM calls per evaluation. This increases both latency and API cost proportionally. Start with k_samples: 3 and increase only if you need higher confidence in the stability estimate.

Threshold labels

Score	Label	Interpretation
≥ 0.85	`Stable Consistency`	Model produces highly consistent responses
≥ 0.60	`Fragile Consistency`	Some variation but core content is maintained
< 0.60	`Unreliable`	Model produces highly inconsistent responses

Configuration

metrics:
  - type: "consistency"
    enabled: true
    params:
      STABLE_CONSISTENCY: 0.85
      FRAGILE_CONSISTENCY: 0.60

Result details

{
    "score": 0.88,
    "label": "Stable Consistency",
    "details": {
        "explanation": "Model produces highly consistent responses.",
        "generated_responses": ["The capital is New Delhi.", "New Delhi serves as India's capital.", ...],
        "std_dev": 0.04,
        "uncertainty": 0.02
    },
    "execution_metadata": {
        "total_cost_usd": 0.000218
    }
}

Source Diversity

Source Diversity measures whether the response draws on multiple independent sources or leans entirely on a single document. Answers synthesized from multiple distinct sources are generally more trustworthy than those derived from a single reference. How it works: TrustifAI resolves a unique source ID for each retrieved document — using source, file_path, or url metadata fields if present, or falling back to a SHA-256 content hash. It then counts the number of distinct source IDs and computes a normalized score that combines a diversity ratio with a count-based exponential reward:

score = 0.6 × (unique_sources / total_docs) + 0.4 × (1 - exp(-unique_sources / 2))

The exponential term rewards having more than one source non-linearly — the jump from 1 to 2 sources matters more than the jump from 5 to 6. If only one document is retrieved and it is the only semantically relevant one, TrustifAI considers this “justified” and assigns a score of 0.8 rather than penalizing the response.

Threshold labels

Score	Label	Interpretation
≥ 0.85	`High Trust`	Multiple independent sources used
≥ 0.60	`Moderate Trust`	Limited corroboration from multiple sources
< 0.60	`Low Trust`	Single or very limited sources used

Configuration

metrics:
  - type: "source_diversity"
    enabled: true
    params:
      HIGH_DIVERSITY: 0.85
      MODERATE_DIVERSITY: 0.60

Result details

{
    "score": 0.72,
    "label": "Moderate Trust",
    "details": {
        "explanation": "Limited corroboration from multiple sources.",
        "unique_sources": 2,
        "total_documents": 5,
        "relevant_documents": 3,
        "justified_single_source": false
    }
}

Accessing individual metric scores

get_trust_score() returns all active metric scores in the details dictionary. Access individual metric results from the return value:

from trustifai import Trustifai, MetricContext

trust_engine = Trustifai(config_path="config_file.yaml")
result = trust_engine.get_trust_score(context)

# Access individual metric scores from the result
coverage  = result["details"]["evidence_coverage"]
drift     = result["details"]["semantic_drift"]
stability = result["details"]["consistency"]
diversity = result["details"]["source_diversity"]

print(f"Evidence Coverage: {coverage['score']} ({coverage['label']})")
print(f"Semantic Drift:    {drift['score']} ({drift['label']})")

For advanced use cases, you can also instantiate individual metric classes directly. See the Offline Metrics API reference for class signatures.

Get Started

Core Concepts

Guides

Offline metrics: detecting hallucination and drift

Metric overview

Evidence Coverage

Threshold labels

Configuration

Result details

Semantic Drift

Threshold labels

Configuration

Result details

Epistemic Consistency

Threshold labels

Configuration

Result details

Source Diversity

Threshold labels

Configuration

Result details

Accessing individual metric scores

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Documentation Index

​Metric overview

​Evidence Coverage

​Threshold labels

​Configuration

​Result details

​Semantic Drift

​Threshold labels

​Configuration

​Result details

​Epistemic Consistency

​Threshold labels

​Configuration

​Result details

​Source Diversity

​Threshold labels

​Configuration

​Result details

​Accessing individual metric scores

Build docs developers (and LLMs) love

Metric overview

Evidence Coverage

Threshold labels

Configuration

Result details

Semantic Drift

Threshold labels

Configuration

Result details

Epistemic Consistency

Threshold labels

Configuration

Result details

Source Diversity

Threshold labels

Configuration

Result details

Accessing individual metric scores