Documentation Index
Fetch the complete documentation index at: https://mintlify.com/TrustifAI/trustifai/llms.txt
Use this file to discover all available pages before exploring further.
evaluate_dataset is the recommended way to score large collections of RAG contexts. It wraps AsyncTrustifai.get_trust_score with a concurrency semaphore, an optional token-bucket rate limiter, exponential-backoff retry on rate-limit errors, an optional tqdm progress bar, and a fail_fast escape hatch. Results are returned in the original dataset order regardless of completion order.
Signature
Parameters
An
AsyncTrustifai instance configured with your YAML file. See AsyncTrustifai.The dataset to evaluate — one
MetricContext per row. Results are returned in the same order as this list.Maximum number of evaluations that run simultaneously. A higher value increases throughput but also increases API load. Must be
>= 1. For free-tier LLM keys set this to 1 to avoid bursting your quota before the rate limiter can react.When
true, displays a tqdm progress bar in the terminal showing completed vs. total evaluations.When
true, the first unhandled exception (non-rate-limit error) cancels all remaining pending tasks and the function returns immediately with the results collected so far. Failed items are recorded in BatchResult.failed. When false (default), failures are isolated per context and the rest of the batch continues.Target request rate for the token-bucket rate limiter. Set this to approximately 80% of your actual API quota to leave headroom for retries. When
None (default), no rate limiting is applied.| Tier | Typical API limit | Recommended setting |
|---|---|---|
| Free-tier (Gemini, Mistral) | 10–15 RPM | 8–12 |
| OpenAI Tier-1 | 500 RPM | 400 |
| Enterprise | 1 000+ RPM | ~800 |
| Local model | Unlimited | None |
Token-bucket burst size — the maximum number of tokens that can accumulate while the limiter is idle. At
1 (default), requests are evenly spaced with no burst allowed, which is safest for free-tier keys. Raise this for enterprise APIs that explicitly allow short bursts above their steady-state RPM.Maximum number of retry attempts on a rate-limit error (HTTP 429 or any exception message matching
"rate limit", "too many requests", or "quota"). Non-rate-limit exceptions are re-raised immediately without retrying.Initial backoff delay in seconds. Doubles on each subsequent attempt (exponential backoff). Each delay is also jittered ±25% to prevent a thundering-herd effect when many contexts fail simultaneously.Example schedule with
retry_base_delay=2.0 and no jitter:| Attempt | Wait before next attempt |
|---|---|
| 1 | 2 s |
| 2 | 4 s |
| 3 | 8 s |
| 4 | 16 s |
| 5 | raise |
Upper bound on backoff delay in seconds. Prevents exponential growth from producing impractically long waits on APIs with very long outage windows.
Return value
Returns aBatchResult dataclass.
Successful trust score dicts in original dataset order. Each dict has the same shape as a single
get_trust_score return value (score, label, details, execution_metadata).One entry per context that raised an unhandled exception.
Number of input contexts.
Number of contexts that completed without error.
Wall-clock time for the full batch run, rounded to three decimal places.
BatchResult also exposes computed properties: mean_score, score_distribution (min/median/max), label_distribution (count per label), failure_rate, and a summary() method that prints a formatted report.
Rate limiting in depth
evaluate_dataset uses a token-bucket algorithm. A bucket starts with rate_limit_burst tokens and refills at a rate of requests_per_minute / 60 tokens per second. Each evaluation consumes one token. If the bucket is empty, the request waits until enough tokens have accumulated.
The concurrency semaphore and the rate limiter act independently:
- The semaphore caps how many evaluations run simultaneously.
- The rate limiter caps how fast new evaluations start.
concurrency based on your available CPU/thread capacity and requests_per_minute based on your LLM API quota. For free-tier keys where the quota is the bottleneck, concurrency=1 keeps things simple.
Examples
evaluate_dataset returns results in original dataset order by sorting on an internal _index field after all tasks complete. The _index key is stripped from each result dict before the function returns.