Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TrustifAI/trustifai/llms.txt

Use this file to discover all available pages before exploring further.

evaluate_dataset is the recommended way to score large collections of RAG contexts. It wraps AsyncTrustifai.get_trust_score with a concurrency semaphore, an optional token-bucket rate limiter, exponential-backoff retry on rate-limit errors, an optional tqdm progress bar, and a fail_fast escape hatch. Results are returned in the original dataset order regardless of completion order.

Signature

from trustifai import AsyncTrustifai, evaluate_dataset

batch = await evaluate_dataset(
    engine,
    contexts,
    concurrency=5,
    show_progress=True,
    fail_fast=False,
    requests_per_minute=None,
    rate_limit_burst=1,
    max_retries=5,
    retry_base_delay=2.0,
    retry_max_delay=120.0,
)

Parameters

engine
AsyncTrustifai
required
An AsyncTrustifai instance configured with your YAML file. See AsyncTrustifai.
contexts
List[MetricContext]
required
The dataset to evaluate — one MetricContext per row. Results are returned in the same order as this list.
concurrency
integer
default:"5"
Maximum number of evaluations that run simultaneously. A higher value increases throughput but also increases API load. Must be >= 1. For free-tier LLM keys set this to 1 to avoid bursting your quota before the rate limiter can react.
show_progress
boolean
default:"true"
When true, displays a tqdm progress bar in the terminal showing completed vs. total evaluations.
fail_fast
boolean
default:"false"
When true, the first unhandled exception (non-rate-limit error) cancels all remaining pending tasks and the function returns immediately with the results collected so far. Failed items are recorded in BatchResult.failed. When false (default), failures are isolated per context and the rest of the batch continues.
requests_per_minute
float
Target request rate for the token-bucket rate limiter. Set this to approximately 80% of your actual API quota to leave headroom for retries. When None (default), no rate limiting is applied.
TierTypical API limitRecommended setting
Free-tier (Gemini, Mistral)10–15 RPM812
OpenAI Tier-1500 RPM400
Enterprise1 000+ RPM~800
Local modelUnlimitedNone
Setting requests_per_minute equal to your actual hard limit gives no retry headroom. If a retry fires while the bucket is empty, it will queue behind the limiter and delay recovery. The 80% rule prevents this.
rate_limit_burst
integer
default:"1"
Token-bucket burst size — the maximum number of tokens that can accumulate while the limiter is idle. At 1 (default), requests are evenly spaced with no burst allowed, which is safest for free-tier keys. Raise this for enterprise APIs that explicitly allow short bursts above their steady-state RPM.
max_retries
integer
default:"5"
Maximum number of retry attempts on a rate-limit error (HTTP 429 or any exception message matching "rate limit", "too many requests", or "quota"). Non-rate-limit exceptions are re-raised immediately without retrying.
retry_base_delay
float
default:"2.0"
Initial backoff delay in seconds. Doubles on each subsequent attempt (exponential backoff). Each delay is also jittered ±25% to prevent a thundering-herd effect when many contexts fail simultaneously.Example schedule with retry_base_delay=2.0 and no jitter:
AttemptWait before next attempt
12 s
24 s
38 s
416 s
5raise
retry_max_delay
float
default:"120.0"
Upper bound on backoff delay in seconds. Prevents exponential growth from producing impractically long waits on APIs with very long outage windows.

Return value

Returns a BatchResult dataclass.
results
List[Dict]
Successful trust score dicts in original dataset order. Each dict has the same shape as a single get_trust_score return value (score, label, details, execution_metadata).
failed
List[Dict]
One entry per context that raised an unhandled exception.
total
integer
Number of input contexts.
succeeded
integer
Number of contexts that completed without error.
elapsed_seconds
float
Wall-clock time for the full batch run, rounded to three decimal places.
BatchResult also exposes computed properties: mean_score, score_distribution (min/median/max), label_distribution (count per label), failure_rate, and a summary() method that prints a formatted report.

Rate limiting in depth

evaluate_dataset uses a token-bucket algorithm. A bucket starts with rate_limit_burst tokens and refills at a rate of requests_per_minute / 60 tokens per second. Each evaluation consumes one token. If the bucket is empty, the request waits until enough tokens have accumulated. The concurrency semaphore and the rate limiter act independently:
  • The semaphore caps how many evaluations run simultaneously.
  • The rate limiter caps how fast new evaluations start.
In practice, set concurrency based on your available CPU/thread capacity and requests_per_minute based on your LLM API quota. For free-tier keys where the quota is the bottleneck, concurrency=1 keeps things simple.

Examples

import asyncio
from trustifai import AsyncTrustifai, MetricContext, evaluate_dataset

async def main():
    engine = AsyncTrustifai("config_file.yaml")

    contexts = [
        MetricContext(
            query=row["question"],
            answer=row["answer"],
            documents=row["retrieved_docs"],
        )
        for row in my_dataset
    ]

    # 10 RPM free-tier: set concurrency=1 and rate limit to 8 RPM (80%)
    batch = await evaluate_dataset(
        engine,
        contexts,
        concurrency=1,
        requests_per_minute=8,
    )

    print(batch.summary())

asyncio.run(main())
evaluate_dataset returns results in original dataset order by sorting on an internal _index field after all tasks complete. The _index key is stripped from each result dict before the function returns.

Build docs developers (and LLMs) love