Documentation Index
Fetch the complete documentation index at: https://mintlify.com/TrustifAI/trustifai/llms.txt
Use this file to discover all available pages before exploring further.
Offline metrics evaluate a response that has already been generated — they do not require hooking into the LLM generation process. You provide a query, an answer, and the retrieved documents, and TrustifAI scores the answer across four orthogonal dimensions of trustworthiness. Each metric is independent, configurable, and can be individually enabled or disabled based on your use case.
Metric overview
| Metric | What it measures | Default weight | Primary signal |
|---|
| Evidence Coverage | Factual grounding of every claim | 0.40 | Hallucination detection |
| Semantic Drift | Topical alignment with source documents | 0.30 | Topic drift detection |
| Epistemic Consistency | Stability across re-generations | 0.20 | Model inconsistency |
| Source Diversity | Breadth of sources used | 0.10 | Over-reliance on single source |
Evidence Coverage
Evidence Coverage checks whether every claim in the response is actually supported by the retrieved documents. It is the most heavily weighted metric by default (0.40) because unsupported claims are the most direct form of hallucination.
How it works: The answer is passed to an LLM alongside the retrieved documents and the original query. The LLM performs sentence-level entailment checking — it breaks the answer into individual sentences and labels each one as supported: true or supported: false relative to the document content. The final score is the fraction of supported sentences.
Evidence Coverage score = supported_sentences / total_sentences
This LLM-based NLI (Natural Language Inference) approach is more robust than embedding similarity alone because it reasons about meaning, not just vector proximity.
Threshold labels
| Score | Label | Interpretation |
|---|
| ≥ 0.85 | Strong Grounding | All or nearly all claims are document-supported |
| ≥ 0.60 | Partial Grounding | Some claims may not be fully supported |
| < 0.60 | Likely Hallucinated Answer | Many claims lack source support |
Configuration
metrics:
- type: "evidence_coverage"
enabled: true
params:
strategy: "llm" # LLM-based entailment (default)
STRONG_GROUNDING: 0.85
PARTIAL_GROUNDING: 0.60
Result details
{
"score": 0.83,
"label": "Partial Grounding",
"details": {
"total_sentences": 6,
"supported_sentences": 5,
"unsupported_sentences": ["The company was founded in 1998."],
"failed_checks": 0
}
}
Semantic Drift
Semantic Drift measures how closely the response stays within the semantic envelope of the retrieved documents. A high-scoring response is topically close to the source material; a low-scoring one has drifted into territory not covered by the documents.
How it works: TrustifAI splits each document into individual sentences and embeds them all. It then computes the cosine similarity between the answer embedding (and query embedding) and each sentence embedding, taking the best match. The final score reflects the peak semantic alignment between the answer and the document corpus.
Semantic Drift score = max cosine_similarity(answer_emb, sentence_emb_i)
across all document sentences
Semantic Drift uses embedding similarity, not LLM reasoning. It catches cases where the response is topically off-base, but will not catch factual errors within the same topic. Use it alongside Evidence Coverage for full hallucination coverage.
Threshold labels
| Score | Label | Interpretation |
|---|
| ≥ 0.85 | Strong Alignment | Answer is semantically well-anchored in the documents |
| ≥ 0.60 | Partial Alignment | Some claims may not align with source documents |
| < 0.60 | Likely Hallucinated Answer | Answer diverges significantly from source content |
Configuration
metrics:
- type: "semantic_drift"
enabled: true
params:
STRONG_ALIGNMENT: 0.85
PARTIAL_ALIGNMENT: 0.60
Result details
{
"score": 0.91,
"label": "Strong Alignment",
"details": {
"total_documents": 3,
"total_sentences_checked": 24,
"best_matching_sentence": "New Delhi is the capital of India.",
"explanation": "Answer semantically aligned with source documents."
}
}
Epistemic Consistency
Epistemic Consistency measures how stable the LLM’s responses are when asked the same question multiple times with different random seeds. Hallucinated answers tend to vary wildly between runs — a model that is genuinely confident produces semantically similar responses even under stochastic conditions.
How it works: TrustifAI generates k additional responses to the same query at elevated temperatures (randomly sampled from [0.7, 0.8, 0.9, 1.0]). Each sample is embedded and compared to the original answer using cosine similarity. The final score is the mean cosine similarity across all samples — equivalent to 1 - σ in the semantic space.
Epistemic Consistency score = mean(cosine_similarity(original_answer, sample_i))
for i in 1..k
The number of samples k is controlled by k_samples in your config. Setting k_samples: 0 skips generation and assumes full consistency (score = 1.0), which is useful for cost-sensitive pipelines.
Epistemic Consistency makes k additional LLM calls per evaluation. This increases both latency and API cost proportionally. Start with k_samples: 3 and increase only if you need higher confidence in the stability estimate.
Threshold labels
| Score | Label | Interpretation |
|---|
| ≥ 0.85 | Stable Consistency | Model produces highly consistent responses |
| ≥ 0.60 | Fragile Consistency | Some variation but core content is maintained |
| < 0.60 | Unreliable | Model produces highly inconsistent responses |
Configuration
metrics:
- type: "consistency"
enabled: true
params:
STABLE_CONSISTENCY: 0.85
FRAGILE_CONSISTENCY: 0.60
Result details
{
"score": 0.88,
"label": "Stable Consistency",
"details": {
"explanation": "Model produces highly consistent responses.",
"generated_responses": ["The capital is New Delhi.", "New Delhi serves as India's capital.", ...],
"std_dev": 0.04,
"uncertainty": 0.02
},
"execution_metadata": {
"total_cost_usd": 0.000218
}
}
Source Diversity
Source Diversity measures whether the response draws on multiple independent sources or leans entirely on a single document. Answers synthesized from multiple distinct sources are generally more trustworthy than those derived from a single reference.
How it works: TrustifAI resolves a unique source ID for each retrieved document — using source, file_path, or url metadata fields if present, or falling back to a SHA-256 content hash. It then counts the number of distinct source IDs and computes a normalized score that combines a diversity ratio with a count-based exponential reward:
score = 0.6 × (unique_sources / total_docs) + 0.4 × (1 - exp(-unique_sources / 2))
The exponential term rewards having more than one source non-linearly — the jump from 1 to 2 sources matters more than the jump from 5 to 6. If only one document is retrieved and it is the only semantically relevant one, TrustifAI considers this “justified” and assigns a score of 0.8 rather than penalizing the response.
Threshold labels
| Score | Label | Interpretation |
|---|
| ≥ 0.85 | High Trust | Multiple independent sources used |
| ≥ 0.60 | Moderate Trust | Limited corroboration from multiple sources |
| < 0.60 | Low Trust | Single or very limited sources used |
Configuration
metrics:
- type: "source_diversity"
enabled: true
params:
HIGH_DIVERSITY: 0.85
MODERATE_DIVERSITY: 0.60
Result details
{
"score": 0.72,
"label": "Moderate Trust",
"details": {
"explanation": "Limited corroboration from multiple sources.",
"unique_sources": 2,
"total_documents": 5,
"relevant_documents": 3,
"justified_single_source": false
}
}
Accessing individual metric scores
get_trust_score() returns all active metric scores in the details dictionary. Access individual metric results from the return value:
from trustifai import Trustifai, MetricContext
trust_engine = Trustifai(config_path="config_file.yaml")
result = trust_engine.get_trust_score(context)
# Access individual metric scores from the result
coverage = result["details"]["evidence_coverage"]
drift = result["details"]["semantic_drift"]
stability = result["details"]["consistency"]
diversity = result["details"]["source_diversity"]
print(f"Evidence Coverage: {coverage['score']} ({coverage['label']})")
print(f"Semantic Drift: {drift['score']} ({drift['label']})")
For advanced use cases, you can also instantiate individual metric classes directly. See the Offline Metrics API reference for class signatures.