evaluate_dataset() — concurrent batch evaluation

evaluate_dataset is the recommended way to score large collections of RAG contexts. It wraps AsyncTrustifai.get_trust_score with a concurrency semaphore, an optional token-bucket rate limiter, exponential-backoff retry on rate-limit errors, an optional tqdm progress bar, and a fail_fast escape hatch. Results are returned in the original dataset order regardless of completion order.

Signature

from trustifai import AsyncTrustifai, evaluate_dataset

batch = await evaluate_dataset(
    engine,
    contexts,
    concurrency=5,
    show_progress=True,
    fail_fast=False,
    requests_per_minute=None,
    rate_limit_burst=1,
    max_retries=5,
    retry_base_delay=2.0,
    retry_max_delay=120.0,
)

Parameters

engine

AsyncTrustifai

required

An AsyncTrustifai instance configured with your YAML file. See AsyncTrustifai.

contexts

List[MetricContext]

required

The dataset to evaluate — one MetricContext per row. Results are returned in the same order as this list.

concurrency

integer

default:"5"

Maximum number of evaluations that run simultaneously. A higher value increases throughput but also increases API load. Must be >= 1. For free-tier LLM keys set this to 1 to avoid bursting your quota before the rate limiter can react.

show_progress

boolean

default:"true"

When true, displays a tqdm progress bar in the terminal showing completed vs. total evaluations.

fail_fast

boolean

default:"false"

When true, the first unhandled exception (non-rate-limit error) cancels all remaining pending tasks and the function returns immediately with the results collected so far. Failed items are recorded in BatchResult.failed. When false (default), failures are isolated per context and the rest of the batch continues.

requests_per_minute

float

Target request rate for the token-bucket rate limiter. Set this to approximately 80% of your actual API quota to leave headroom for retries. When None (default), no rate limiting is applied.

Tier	Typical API limit	Recommended setting
Free-tier (Gemini, Mistral)	10–15 RPM	`8`–`12`
OpenAI Tier-1	500 RPM	`400`
Enterprise	1 000+ RPM	`~800`
Local model	Unlimited	`None`

Setting requests_per_minute equal to your actual hard limit gives no retry headroom. If a retry fires while the bucket is empty, it will queue behind the limiter and delay recovery. The 80% rule prevents this.

rate_limit_burst

integer

default:"1"

Token-bucket burst size — the maximum number of tokens that can accumulate while the limiter is idle. At 1 (default), requests are evenly spaced with no burst allowed, which is safest for free-tier keys. Raise this for enterprise APIs that explicitly allow short bursts above their steady-state RPM.

max_retries

integer

default:"5"

Maximum number of retry attempts on a rate-limit error (HTTP 429 or any exception message matching "rate limit", "too many requests", or "quota"). Non-rate-limit exceptions are re-raised immediately without retrying.

retry_base_delay

float

default:"2.0"

Initial backoff delay in seconds. Doubles on each subsequent attempt (exponential backoff). Each delay is also jittered ±25% to prevent a thundering-herd effect when many contexts fail simultaneously.Example schedule with retry_base_delay=2.0 and no jitter:

Attempt	Wait before next attempt
1	2 s
2	4 s
3	8 s
4	16 s
5	raise

retry_max_delay

float

default:"120.0"

Upper bound on backoff delay in seconds. Prevents exponential growth from producing impractically long waits on APIs with very long outage windows.

Return value

Returns a BatchResult dataclass.

results

List[Dict]

Successful trust score dicts in original dataset order. Each dict has the same shape as a single get_trust_score return value (score, label, details, execution_metadata).

failed

List[Dict]

One entry per context that raised an unhandled exception.

Show failure entry shape

index

integer

Position of the context in the original contexts list.

context

MetricContext

The MetricContext that caused the failure.

error

string

String representation of the exception.

total

integer

Number of input contexts.

succeeded

integer

Number of contexts that completed without error.

elapsed_seconds

float

Wall-clock time for the full batch run, rounded to three decimal places.

BatchResult also exposes computed properties: mean_score, score_distribution (min/median/max), label_distribution (count per label), failure_rate, and a summary() method that prints a formatted report.

Rate limiting in depth

evaluate_dataset uses a token-bucket algorithm. A bucket starts with rate_limit_burst tokens and refills at a rate of requests_per_minute / 60 tokens per second. Each evaluation consumes one token. If the bucket is empty, the request waits until enough tokens have accumulated. The concurrency semaphore and the rate limiter act independently:

The semaphore caps how many evaluations run simultaneously.
The rate limiter caps how fast new evaluations start.

In practice, set concurrency based on your available CPU/thread capacity and requests_per_minute based on your LLM API quota. For free-tier keys where the quota is the bottleneck, concurrency=1 keeps things simple.

Examples

import asyncio
from trustifai import AsyncTrustifai, MetricContext, evaluate_dataset

async def main():
    engine = AsyncTrustifai("config_file.yaml")

    contexts = [
        MetricContext(
            query=row["question"],
            answer=row["answer"],
            documents=row["retrieved_docs"],
        )
        for row in my_dataset
    ]

    # 10 RPM free-tier: set concurrency=1 and rate limit to 8 RPM (80%)
    batch = await evaluate_dataset(
        engine,
        contexts,
        concurrency=1,
        requests_per_minute=8,
    )

    print(batch.summary())

asyncio.run(main())

evaluate_dataset returns results in original dataset order by sorting on an internal _index field after all tasks complete. The _index key is stripped from each result dict before the function returns.

Core API

Data Structures

Metrics

evaluate_dataset() — concurrent batch evaluation

Signature

Parameters

Return value

Rate limiting in depth

Examples

Build docs developers (and LLMs) love

Core API

Data Structures

Metrics

Documentation Index

​Signature

​Parameters

​Return value

​Rate limiting in depth

​Examples

Build docs developers (and LLMs) love

Signature

Parameters

Return value

Rate limiting in depth

Examples