Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TrustifAI/trustifai/llms.txt

Use this file to discover all available pages before exploring further.

TrustifAI’s metric system is a plugin registry. Every built-in metric — evidence coverage, epistemic consistency, semantic drift, and source diversity — is registered against a string key and instantiated at evaluation time. You can add your own metrics by following the same three-step pattern: inherit, register, configure.

The three-step process

1

Inherit from BaseMetric and implement calculate()

Your class must inherit from BaseMetric and implement calculate(context: MetricContext) -> MetricResult. The context argument carries the query, answer, documents, and pre-computed embeddings for a single evaluation.
from trustifai.metrics import BaseMetric
from trustifai.structures import MetricContext, MetricResult

class MyCustomMetric(BaseMetric):
    def calculate(self, context: MetricContext) -> MetricResult:
        # Your evaluation logic here
        score = 0.9  # float in [0.0, 1.0]
        return MetricResult(
            score=score,
            label="High",
            details={"note": "example"},
        )
2

Register the metric class

Call Trustifai.register_metric with a unique string key. This key must match the type field you will add to config_file.yaml.
from trustifai import Trustifai

Trustifai.register_metric("my_custom_metric", MyCustomMetric)
Registration is a class-level operation — call it once, before you instantiate any Trustifai engine.
3

Add the metric to config_file.yaml

Add entries to both the metrics list (to set thresholds and mark it enabled) and the score_weights list (to assign its contribution to the final Trust Score). Weights across all enabled metrics must sum to at most 1.0.
metrics:
  # ... existing metrics ...
  - type: "my_custom_metric"
    enabled: true
    params:
      MY_HIGH_THRESHOLD: 0.80
      MY_LOW_THRESHOLD: 0.50

score_weights:
  # ... existing weights (reduce others to make room) ...
  - type: "my_custom_metric"
    params:
      weight: 0.10

Full example: TemporalConsistencyMetric

The following example detects temporal hallucinations — cases where the answer references dates or times that are not present in the retrieved documents. It is the canonical custom metric example from the TrustifAI README.

Metric implementation

from trustifai.metrics import BaseMetric
from trustifai.structures import MetricContext, MetricResult


class TemporalConsistencyMetric(BaseMetric):
    """Detects temporal hallucinations — when the answer references dates or
    times that don't match the retrieved documents."""

    def calculate(self, context: MetricContext) -> MetricResult:
        # Extract dates from answer and documents
        answer_dates = self._extract_dates(context.answer)

        doc_dates = set()
        for doc in context.documents:
            doc_dates.update(self._extract_dates(doc.page_content))

        # No temporal claims in the answer — award full score
        if not answer_dates:
            return MetricResult(
                score=1.0,
                label="No Temporal Claims",
                details={"answer_dates": [], "doc_dates": list(doc_dates)},
            )

        supported_dates = [d for d in answer_dates if d in doc_dates]
        unsupported_dates = [d for d in answer_dates if d not in doc_dates]

        score = len(supported_dates) / len(answer_dates) if answer_dates else 1.0

        # Read custom thresholds from config (with sensible defaults)
        high_threshold = getattr(self.config.thresholds, "TEMPORALLY_CONSISTENT", 0.8)
        low_threshold = getattr(self.config.thresholds, "PARTIAL_TEMPORAL_ISSUES", 0.5)

        if score >= high_threshold:
            label = "Temporally Consistent"
        elif score >= low_threshold:
            label = "Partial Temporal Issues"
        else:
            label = "Temporal Hallucination Detected"

        return MetricResult(
            score=score,
            label=label,
            details={
                "answer_dates": answer_dates,
                "supported_dates": supported_dates,
                "unsupported_dates": unsupported_dates,
                "doc_dates": list(doc_dates),
            },
        )

    def _extract_dates(self, text: str) -> list[str]:
        """Stub — replace with a real date extraction implementation."""
        import re
        pattern = r"\b\d{4}\b"   # simple 4-digit year extraction
        return re.findall(pattern, text)

Registration and usage

from trustifai import Trustifai, MetricContext

# Register before instantiating any engine
Trustifai.register_metric("temporal_consistency", TemporalConsistencyMetric)

# The engine picks up the new metric from config_file.yaml
trust_engine = Trustifai("config_file.yaml")

context = MetricContext(
    query="When was the Eiffel Tower built?",
    answer="The Eiffel Tower was built in 1889.",
    documents=["The Eiffel Tower was constructed from 1887 to 1889."],
)

result = trust_engine.get_trust_score(context)
print(result)

Updated config_file.yaml

metrics:
  - type: "evidence_coverage"
    enabled: true
    params:
      STRONG_GROUNDING: 0.85
      PARTIAL_GROUNDING: 0.60

  - type: "consistency"
    enabled: true
    params:
      STABLE_CONSISTENCY: 0.85
      FRAGILE_CONSISTENCY: 0.60

  - type: "source_diversity"
    enabled: true
    params:
      HIGH_DIVERSITY: 0.85
      MODERATE_DIVERSITY: 0.60

  - type: "semantic_drift"
    enabled: true
    params:
      STRONG_ALIGNMENT: 0.85
      PARTIAL_ALIGNMENT: 0.60

  - type: "temporal_consistency"     # your new metric
    enabled: true
    params:
      TEMPORALLY_CONSISTENT: 0.80
      PARTIAL_TEMPORAL_ISSUES: 0.50

score_weights:
  - type: "evidence_coverage"
    params:
      weight: 0.35   # reduced to make room for the new metric
  - type: "consistency"
    params:
      weight: 0.20
  - type: "source_diversity"
    params:
      weight: 0.10
  - type: "semantic_drift"
    params:
      weight: 0.25
  - type: "temporal_consistency"     # your new metric
    params:
      weight: 0.10   # weights still sum to 1.0

MetricResult fields

Every calculate() implementation must return a MetricResult. The to_dict() method serializes it into the format consumed by the trust score aggregator.
FieldTypeRequiredDescription
scorefloatYesNormalized metric score in the range [0.0, 1.0]
labelstrYesHuman-readable classification label (e.g. "Temporally Consistent")
detailsdictYesArbitrary diagnostic data included in the result payload
execution_metadatadict | NoneNoOptional metadata, typically {"total_cost_usd": float} for LLM-backed metrics

BaseMetric helpers

When you inherit from BaseMetric, your class automatically gets access to:
AttributeTypeDescription
self.serviceExternalServiceProvides llm_call, embedding_call, embedding_call_batch, and reranker_call
self.configConfigFull parsed configuration, including config.thresholds and config.weights
self.cosine_calcCosineSimCalculatorUtility for computing cosine similarity between embedding vectors
self.threshold_evaluatorThresholdEvaluatorClassifies a float score against the configured threshold pairs
Use self.service.llm_call(prompt, system_prompt) if your metric needs an LLM inference step, and self.service.embedding_call(text) for additional embeddings beyond what the engine pre-computes.

Async metrics

BaseMetric provides a default async implementation that calls calculate() synchronously:
async def a_calculate(self, context: MetricContext) -> MetricResult:
    return self.calculate(context)
Override a_calculate if your metric can benefit from native async I/O (for example, if it makes multiple LLM calls that can be parallelized):
class MyAsyncMetric(BaseMetric):
    async def a_calculate(self, context: MetricContext) -> MetricResult:
        # Native async logic here
        result = await some_async_call(context)
        return MetricResult(score=result, label="...", details={})

    def calculate(self, context: MetricContext) -> MetricResult:
        import asyncio
        return asyncio.run(self.a_calculate(context))
When adding a new metric, make sure the total of all score_weights still sums to at most 1.0. TrustifAI raises a ValueError at startup if the sum exceeds this limit. Reduce existing weights proportionally to accommodate the new metric’s weight.

Configuration

Learn the full config_file.yaml schema including metric thresholds and weights.

BaseMetric API

Full API reference for BaseMetric, MetricResult, and MetricContext.

Build docs developers (and LLMs) love