TrustifAI’s four offline metrics are computed post-generation using the query, answer, and retrieved documents from aDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/TrustifAI/trustifai/llms.txt
Use this file to discover all available pages before exploring further.
MetricContext. They run automatically when you call get_trust_score and are exported from trustifai.metrics. Each metric is a subclass of BaseMetric and returns a MetricResult.
EvidenceCoverageMetric — LLM-based sentence entailment
EvidenceCoverageMetric — LLM-based sentence entailment
EvidenceCoverageMetric measures how much of the LLM’s answer is factually grounded in the retrieved documents. It sends the query, full answer, and document texts to your LLM in a single structured prompt that asks for sentence-level entailment judgments. The final score is the ratio of supported sentences to total sentences.supported_sentences / total_sentencesLabels (configurable via thresholds):"Strong Grounding"— high coverage"Partial Grounding"— moderate coverage"Likely Hallucinated Answer"— low coverage
details dict:Returns
score=0.0 with label="Empty Answer" if context.answer is empty or context.documents is an empty list.SemanticDriftMetric — cosine similarity to best document sentence
SemanticDriftMetric — cosine similarity to best document sentence
SemanticDriftMetric detects semantic drift between the answer and the retrieved context without requiring an LLM call. It tokenizes every document into sentences, embeds them in a single batched embedding call, then finds the sentence with the highest cosine similarity to either the answer or the query. The best-match score is the metric output.[0.0, 1.0].Labels (configurable via thresholds):"Strong Alignment"— answer closely mirrors the documents"Partial Alignment"— answer partially aligns"Likely Hallucinated Answer"— answer diverges significantly
details dict:Long best-matching sentences are truncated to 150 characters in the details dict with a
" ... [truncated]" suffix.EpistemicConsistencyMetric — stochastic re-generation agreement
EpistemicConsistencyMetric — stochastic re-generation agreement
EpistemicConsistencyMetric probes whether the model is confident in its answer by generating k alternative responses to the same question at randomized temperatures (0.7–1.0) and measuring cosine similarity between each re-generation and the original answer. High mean similarity indicates stable knowledge; low similarity indicates the model is guessing.k stochastic re-generations and the original answer. Default k=3 (set via config.k_samples).Labels (configurable via thresholds):"Stable Consistency"— high agreement across re-generations"Fragile Consistency"— moderate agreement"Unreliable"— low agreement
details dict:a_calculate is a native async implementation that fires all k LLM generation calls concurrently with asyncio.gather, making it significantly faster than the synchronous calculate path in async server environments (FastAPI, etc.). The synchronous calculate method runs the same logic in an isolated thread to avoid event loop conflicts.SourceDiversityMetric — normalized unique source count
SourceDiversityMetric — normalized unique source count
SourceDiversityMetric measures how many independent sources back the retrieved documents. It uses a two-part formula that rewards both diversity ratio (unique sources / total docs) and raw source count via exponential decay, producing a score that saturates toward 1.0 as source variety increases.0.8.Labels (configurable via thresholds):"High Trust"— diverse multi-source retrieval"Moderate Trust"— limited corroboration"Low Trust"— single or near-single source
details dict:source, file_path, url, filename, doc_id). If no metadata key is found, a SHA-256 hash of the document text is used as the source identifier.This metric makes no LLM or embedding API calls beyond the query embedding already computed by
get_trust_score. execution_metadata.total_cost_usd is always 0.0.