Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TrustifAI/trustifai/llms.txt

Use this file to discover all available pages before exploring further.

BatchResult is returned by evaluate_dataset after all contexts in a dataset have been processed. Successful evaluations are stored in order in .results, while any context that raised an unhandled exception is captured in .failed — a single failure never aborts the rest of the batch. Aggregate statistics are computed as properties on demand, so there is no overhead when you only need the raw results.

Fields

results
List[dict]
required
Successful trust score dicts, sorted in the original dataset order. Each dict is the output of Trustifai.get_trust_score() for one MetricContext. Suitable for direct conversion to a pandas DataFrame with pd.DataFrame(batch.results).
failed
List[dict]
required
List of evaluation failures. Each entry contains:
total
int
required
Total number of MetricContext objects passed to evaluate_dataset.
succeeded
int
required
Number of contexts that completed evaluation without raising an exception.
elapsed_seconds
float
required
Wall-clock duration of the full batch run in seconds, measured from the first task dispatch to the last task completion.

Computed properties

mean_score
float
Mean trust score across all successful evaluations, rounded to 4 decimal places. Returns 0.0 if no results are available.
score_distribution
dict
Min, median, and max trust scores across successful results.
label_distribution
dict
Count of each trust label across successful results, keyed by label string. Labels reflect your configured thresholds — typical values are "RELIABLE", "ACCEPTABLE", and "UNRELIABLE".
failure_rate
float
Fraction of contexts that failed, rounded to 4 decimal places. Computed as len(failed) / total. Returns 0.0 if total is zero.

.summary() method

Returns a formatted multi-line text summary of the batch run, including throughput and aggregate score statistics. Useful for quick inspection in notebooks or CLI output.
batch.summary() -> str

Example output

────────────────────────────────────────────────
  Batch Evaluation Summary
────────────────────────────────────────────────
  Total          : 120
  Succeeded      : 118
  Failed         : 2 (1.7%)
  Elapsed        : 43.2s
  Throughput     : 2.7 eval/s
────────────────────────────────────────────────
  Mean Score     : 0.7341
  Score Range    : {'min': 0.21, 'median': 0.7612, 'max': 0.98}
  Labels         : {'ACCEPTABLE': 34, 'RELIABLE': 71, 'UNRELIABLE': 13}
────────────────────────────────────────────────

pandas integration

batch.results is a list of plain dicts with a consistent schema, making it a natural fit for DataFrame analysis:
import pandas as pd
from trustifai.async_pipeline import AsyncTrustifai, evaluate_dataset

engine = AsyncTrustifai("config.yaml")
batch = await evaluate_dataset(engine, contexts, concurrency=10)

df = pd.DataFrame(batch.results)
print(df[["score", "label"]].describe())

# Filter low-trust responses for review
unreliable = df[df["label"] == "UNRELIABLE"]
batch.results preserves the original dataset order — the async pipeline internally sorts by the input index before returning, so row i in the DataFrame corresponds to context i in your input list.

Build docs developers (and LLMs) love