Offline metrics reference: the four trust signals

EvidenceCoverageMetric — LLM-based sentence entailment

EvidenceCoverageMetric measures how much of the LLM’s answer is factually grounded in the retrieved documents. It sends the query, full answer, and document texts to your LLM in a single structured prompt that asks for sentence-level entailment judgments. The final score is the ratio of supported sentences to total sentences.

class EvidenceCoverageMetric(BaseMetric):
    def calculate(self, context: MetricContext) -> MetricResult: ...
    async def a_calculate(self, context: MetricContext) -> MetricResult: ...

Score formula: supported_sentences / total_sentencesLabels (configurable via thresholds):

"Strong Grounding" — high coverage
"Partial Grounding" — moderate coverage
"Likely Hallucinated Answer" — low coverage

Example details dict:

{
    "explanation": "Fully supported by source documents.",
    "total_sentences": 5,
    "supported_sentences": 4,
    "unsupported_sentences": ["The event happened in 1945."],
    "failed_checks": 0,
    "failed_reason": None,
}

Returns score=0.0 with label="Empty Answer" if context.answer is empty or context.documents is an empty list.

SemanticDriftMetric — cosine similarity to best document sentence

SemanticDriftMetric detects semantic drift between the answer and the retrieved context without requiring an LLM call. It tokenizes every document into sentences, embeds them in a single batched embedding call, then finds the sentence with the highest cosine similarity to either the answer or the query. The best-match score is the metric output.

class SemanticDriftMetric(BaseMetric):
    def calculate(self, context: MetricContext) -> MetricResult: ...

Score: cosine similarity between the answer (or query) embedding and the best-matching document sentence embedding. Range: [0.0, 1.0].Labels (configurable via thresholds):

"Strong Alignment" — answer closely mirrors the documents
"Partial Alignment" — answer partially aligns
"Likely Hallucinated Answer" — answer diverges significantly

Example details dict:

{
    "explanation": "Answer semantically aligned with source documents.",
    "total_documents": 3,
    "total_sentences_checked": 17,
    "best_matching_sentence": "The Eiffel Tower stands 330 metres tall and was completed in 1889.",
}

Long best-matching sentences are truncated to 150 characters in the details dict with a " ... [truncated]" suffix.

EpistemicConsistencyMetric — stochastic re-generation agreement

EpistemicConsistencyMetric probes whether the model is confident in its answer by generating k alternative responses to the same question at randomized temperatures (0.7–1.0) and measuring cosine similarity between each re-generation and the original answer. High mean similarity indicates stable knowledge; low similarity indicates the model is guessing.

class EpistemicConsistencyMetric(BaseMetric):
    def calculate(self, context: MetricContext) -> MetricResult: ...
    async def a_calculate(self, context: MetricContext) -> MetricResult: ...

Score: mean cosine similarity across k stochastic re-generations and the original answer. Default k=3 (set via config.k_samples).Labels (configurable via thresholds):

"Stable Consistency" — high agreement across re-generations
"Fragile Consistency" — moderate agreement
"Unreliable" — low agreement

Example details dict:

{
    "explanation": "Model produces highly consistent responses.",
    "generated_responses": [
        "The Eiffel Tower is located in Paris and is 330 m tall.",
        "Paris hosts the Eiffel Tower, completed in 1889 at 330 metres.",
        "The Eiffel Tower, 330 m, stands in Paris, France.",
    ],
    "std_dev": 0.03,
    "uncertainty": 0.02,
}

Async variant:a_calculate is a native async implementation that fires all k LLM generation calls concurrently with asyncio.gather, making it significantly faster than the synchronous calculate path in async server environments (FastAPI, etc.). The synchronous calculate method runs the same logic in an isolated thread to avoid event loop conflicts.

# In an async context (FastAPI, async pipeline)
result = await metric.a_calculate(context)

Set k_samples: 0 in your config to skip re-generation entirely. The metric will return score=1.0 with label="Stable Consistency" at zero cost, useful when evaluating on a cost budget.

SourceDiversityMetric — normalized unique source count

SourceDiversityMetric measures how many independent sources back the retrieved documents. It uses a two-part formula that rewards both diversity ratio (unique sources / total docs) and raw source count via exponential decay, producing a score that saturates toward 1.0 as source variety increases.

class SourceDiversityMetric(BaseMetric):
    def calculate(self, context: MetricContext) -> MetricResult: ...

Score formula:

score = 0.6 × (unique_sources / total_docs) + 0.4 × (1 − exp(−unique_sources / 2))

When a single source is identified but is the only document semantically relevant to the query (cosine similarity ≥ 0.5), the single-source penalty is waived and the score is set to 0.8.Labels (configurable via thresholds):

"High Trust" — diverse multi-source retrieval
"Moderate Trust" — limited corroboration
"Low Trust" — single or near-single source

Example details dict:

{
    "explanation": "Multiple independent sources used for answer.",
    "unique_sources": 4,
    "total_documents": 5,
    "relevant_documents": 4,
    "justified_single_source": False,
}

Source identification: sources are resolved from document metadata using known source fields (source, file_path, url, filename, doc_id). If no metadata key is found, a SHA-256 hash of the document text is used as the source identifier.

This metric makes no LLM or embedding API calls beyond the query embedding already computed by get_trust_score. execution_metadata.total_cost_usd is always 0.0.

Core API

Data Structures

Metrics

Offline metrics reference: the four trust signals

Build docs developers (and LLMs) love

Core API

Data Structures

Metrics

Documentation Index

Build docs developers (and LLMs) love