BatchResult — aggregate dataset evaluation output

BatchResult is returned by evaluate_dataset after all contexts in a dataset have been processed. Successful evaluations are stored in order in .results, while any context that raised an unhandled exception is captured in .failed — a single failure never aborts the rest of the batch. Aggregate statistics are computed as properties on demand, so there is no overhead when you only need the raw results.

Fields

results

List[dict]

required

Successful trust score dicts, sorted in the original dataset order. Each dict is the output of Trustifai.get_trust_score() for one MetricContext. Suitable for direct conversion to a pandas DataFrame with pd.DataFrame(batch.results).

failed

List[dict]

required

List of evaluation failures. Each entry contains:

Show properties

index

int

Zero-based position of the failed context in the original input list.

context

MetricContext

The MetricContext that caused the failure, for inspection or retry.

error

str

String representation of the exception that was raised.

total

int

required

Total number of MetricContext objects passed to evaluate_dataset.

succeeded

int

required

Number of contexts that completed evaluation without raising an exception.

elapsed_seconds

float

required

Wall-clock duration of the full batch run in seconds, measured from the first task dispatch to the last task completion.

Computed properties

mean_score

float

Mean trust score across all successful evaluations, rounded to 4 decimal places. Returns 0.0 if no results are available.

score_distribution

dict

Min, median, and max trust scores across successful results.

Show properties

min

float

Lowest trust score in the batch.

median

float

Median trust score, rounded to 4 decimal places.

max

float

Highest trust score in the batch.

label_distribution

dict

Count of each trust label across successful results, keyed by label string. Labels reflect your configured thresholds — typical values are "RELIABLE", "ACCEPTABLE", and "UNRELIABLE".

failure_rate

float

Fraction of contexts that failed, rounded to 4 decimal places. Computed as len(failed) / total. Returns 0.0 if total is zero.

`.summary()` method

Returns a formatted multi-line text summary of the batch run, including throughput and aggregate score statistics. Useful for quick inspection in notebooks or CLI output.

batch.summary() -> str

Example output

────────────────────────────────────────────────
  Batch Evaluation Summary
────────────────────────────────────────────────
  Total          : 120
  Succeeded      : 118
  Failed         : 2 (1.7%)
  Elapsed        : 43.2s
  Throughput     : 2.7 eval/s
────────────────────────────────────────────────
  Mean Score     : 0.7341
  Score Range    : {'min': 0.21, 'median': 0.7612, 'max': 0.98}
  Labels         : {'ACCEPTABLE': 34, 'RELIABLE': 71, 'UNRELIABLE': 13}
────────────────────────────────────────────────

pandas integration

batch.results is a list of plain dicts with a consistent schema, making it a natural fit for DataFrame analysis:

import pandas as pd
from trustifai.async_pipeline import AsyncTrustifai, evaluate_dataset

engine = AsyncTrustifai("config.yaml")
batch = await evaluate_dataset(engine, contexts, concurrency=10)

df = pd.DataFrame(batch.results)
print(df[["score", "label"]].describe())

# Filter low-trust responses for review
unreliable = df[df["label"] == "UNRELIABLE"]

batch.results preserves the original dataset order — the async pipeline internally sorts by the input index before returning, so row i in the DataFrame corresponds to context i in your input list.

Core API

Data Structures

Metrics

BatchResult — aggregate dataset evaluation output

Fields

Computed properties

`.summary()` method

Example output

pandas integration

Build docs developers (and LLMs) love

Core API

Data Structures

Metrics

Documentation Index

​Fields

​Computed properties

​.summary() method

​Example output

​pandas integration

Build docs developers (and LLMs) love

Fields

Computed properties

`.summary()` method

Example output

pandas integration