T for your workload. The threshold controls whether the gateway accepts the drafter’s response or escalates to the heavyweight model. Setting it too low wastes money on unnecessary escalations. Setting it too high risks serving bad drafts.
The calibration tool sweeps a range of candidate thresholds over a labeled benchmark dataset and computes accuracy and cost metrics at each value. You then pick the threshold at the knee of the accuracy-cost curve: the point where cost savings are high but accuracy remains acceptable.
The default threshold in config.yaml — entropy.threshold: 2.0 — was selected this way, calibrated on 518 prompts across four categories using gpt-4.1-nano as the drafter and gpt-4.1 as the heavyweight.
How the sweep works
The sweep tool inbenchmarks/cmd/sweep/ does not call any live API. It replays pre-collected token records offline, simulating what the routing engine would have decided at each threshold value, and compares those decisions against ground-truth LLM-as-judge labels.
For each candidate threshold, the sweep iterates over every record in the dataset and runs the entropy window algorithm against the stored token logprob sequence. It produces a routing decision — accept or escalate — for each record, then compares that decision against whether the draft response was actually acceptable.
Collect token records
Run
benchmarks/cmd/collect/ to call the drafter and heavyweight on every prompt in your dataset and record the per-token logprob sequences and judge verdicts to a JSONL file.Run the threshold sweep
Run
benchmarks/cmd/sweep/ against the collected JSONL. The sweep replays all records at each candidate threshold, computes metrics, and writes a CSV summary.Confusion matrix
Each record produces one of four outcomes, depending on the routing decision and the judge verdict:| Would escalate | Would accept | |
|---|---|---|
| Draft unacceptable | TP — correct escalation | FN — bad draft served |
| Draft acceptable | FP — unnecessary escalation (cost waste) | TN — correct acceptance |
- TP (true positive): The drafter was uncertain and the gateway correctly escalated to the heavyweight. Good outcome — the bad draft was caught.
- TN (true negative): The drafter was confident and the draft was acceptable. The gateway correctly served the draft without paying for the heavyweight.
- FP (false positive): The drafter was uncertain, so the gateway escalated — but the draft was actually acceptable. The escalation was wasted cost.
- FN (false negative): The drafter was confident, so the gateway accepted the draft — but the draft was not acceptable. The bad draft was served to the user.
Metrics computed per threshold
The sweep computes these metrics for each candidate threshold:| Metric | Definition |
|---|---|
| Escalation rate | Fraction of requests routed to the heavyweight (EscalatedCount / TotalPrompts) |
| Draft accuracy | Fraction of accepted drafts that were acceptable (TN / (TN + FN)) |
| Cost reduction | 1 - estimated_cost / baseline_cost, where baseline is all-heavyweight routing |
| Precision | TP / (TP + FP) — of all escalations, how many were justified |
| Recall | TP / (TP + FN) — of all bad drafts, how many were caught |
| F1 | Harmonic mean of precision and recall |
Calibration results
The table below shows the full sweep results from the reference calibration run: 518 valid prompts,gpt-4.1-nano drafter, gpt-4.1 heavyweight.
| Threshold | Escalation rate | Draft accuracy | Cost reduction | F1 |
|---|---|---|---|---|
| 1.00 | 68.9% | 100.0% | 8.2% | 0.06 |
| 1.25 | 49.2% | 99.6% | 31.0% | 0.08 |
| 1.50 | 30.9% | 98.6% | 56.2% | 0.07 |
| 1.75 | 13.9% | 98.4% | 81.2% | 0.10 |
| 2.00 | 6.0% | 98.2% | 91.6% | 0.10 |
| 2.25 | 0.4% | 97.9% | 99.0% | 0.00 |
| 2.50 | 0.0% | 97.9% | 99.2% | 0.00 |
Selected threshold: T=2.0
T=2.0 was selected as the best threshold: it achieves the highest F1 score (0.10) among all thresholds where draft accuracy is at or above 95%. At this value:- 94% draft acceptance — the heavyweight is called on only 6% of requests.
- 98.2% draft accuracy — almost all accepted drafts are acceptable to serve.
- 91.6% cost reduction — compared to routing every request to the heavyweight.
Tradeoff guidance
Lower thresholds are more conservative and more expensive. Higher thresholds are cheaper but riskier.Lower T (e.g. T=1.5)
More escalations. Higher accuracy. Higher cost. Use when incorrect answers are expensive — customer-facing support, medical, legal, financial workloads.
Higher T (e.g. T=2.25)
Fewer escalations. Lower accuracy. Lower cost. Use when drafts are low-stakes and you are primarily optimizing for throughput cost.
Known failure mode: confident hallucination. Entropy calibration cannot catch cases where the drafter generates a wrong answer with low uncertainty. If the drafter assigns near-uniform probability to a plausible-but-incorrect token sequence, entropy stays low, the response is accepted, and the error is served. This is a documented limitation, not a bug. Mitigations include periodic accuracy audits against a held-out labeled set, downstream user feedback loops, and a conservative threshold. See the README for the full discussion.
How to re-calibrate for your workload
The reference calibration was run on a general-purpose prompt set. If your workload is domain-specific — legal documents, code review, medical Q&A — you should re-calibrate against prompts representative of your traffic.Prepare your prompt dataset
Edit or replace
benchmarks/testdata/prompts.json. Each entry requires an id, a category (simple_factual, multi_step_reasoning, code_generation, or ambiguous_creative), and a text field.Collect responses
Run the collector against your dataset. Set The collector calls both the drafter (
OPENAI_API_KEY before running.gpt-4.1-nano) and heavyweight (gpt-4.1) for each prompt, runs LLM-as-judge evaluation, and writes the results to a JSONL file. The collection is resumable — already-collected prompt IDs are skipped on restart.Run the threshold sweep
Sweep a range of thresholds over the collected data. Adjust the The sweep prints a summary table to stdout and writes full per-threshold metrics to the CSV file.
--thresholds list to include values relevant to your use case.