Skip to main content
Calibration is the process of finding the right entropy threshold T for your workload. The threshold controls whether the gateway accepts the drafter’s response or escalates to the heavyweight model. Setting it too low wastes money on unnecessary escalations. Setting it too high risks serving bad drafts. The calibration tool sweeps a range of candidate thresholds over a labeled benchmark dataset and computes accuracy and cost metrics at each value. You then pick the threshold at the knee of the accuracy-cost curve: the point where cost savings are high but accuracy remains acceptable. The default threshold in config.yamlentropy.threshold: 2.0 — was selected this way, calibrated on 518 prompts across four categories using gpt-4.1-nano as the drafter and gpt-4.1 as the heavyweight.

How the sweep works

The sweep tool in benchmarks/cmd/sweep/ does not call any live API. It replays pre-collected token records offline, simulating what the routing engine would have decided at each threshold value, and compares those decisions against ground-truth LLM-as-judge labels. For each candidate threshold, the sweep iterates over every record in the dataset and runs the entropy window algorithm against the stored token logprob sequence. It produces a routing decision — accept or escalate — for each record, then compares that decision against whether the draft response was actually acceptable.
1

Collect token records

Run benchmarks/cmd/collect/ to call the drafter and heavyweight on every prompt in your dataset and record the per-token logprob sequences and judge verdicts to a JSONL file.
2

Run the threshold sweep

Run benchmarks/cmd/sweep/ against the collected JSONL. The sweep replays all records at each candidate threshold, computes metrics, and writes a CSV summary.
3

Select the best threshold

The sweep auto-selects the threshold with the highest F1 score among those where draft accuracy ≥ 95%. Update entropy.threshold in config.yaml with the selected value.

Confusion matrix

Each record produces one of four outcomes, depending on the routing decision and the judge verdict:
Would escalateWould accept
Draft unacceptableTP — correct escalationFN — bad draft served
Draft acceptableFP — unnecessary escalation (cost waste)TN — correct acceptance
  • TP (true positive): The drafter was uncertain and the gateway correctly escalated to the heavyweight. Good outcome — the bad draft was caught.
  • TN (true negative): The drafter was confident and the draft was acceptable. The gateway correctly served the draft without paying for the heavyweight.
  • FP (false positive): The drafter was uncertain, so the gateway escalated — but the draft was actually acceptable. The escalation was wasted cost.
  • FN (false negative): The drafter was confident, so the gateway accepted the draft — but the draft was not acceptable. The bad draft was served to the user.
FN is the dangerous outcome. FP is the expensive outcome. Calibration finds the threshold that minimizes FN while keeping FP manageable.

Metrics computed per threshold

The sweep computes these metrics for each candidate threshold:
MetricDefinition
Escalation rateFraction of requests routed to the heavyweight (EscalatedCount / TotalPrompts)
Draft accuracyFraction of accepted drafts that were acceptable (TN / (TN + FN))
Cost reduction1 - estimated_cost / baseline_cost, where baseline is all-heavyweight routing
PrecisionTP / (TP + FP) — of all escalations, how many were justified
RecallTP / (TP + FN) — of all bad drafts, how many were caught
F1Harmonic mean of precision and recall

Calibration results

The table below shows the full sweep results from the reference calibration run: 518 valid prompts, gpt-4.1-nano drafter, gpt-4.1 heavyweight.
ThresholdEscalation rateDraft accuracyCost reductionF1
1.0068.9%100.0%8.2%0.06
1.2549.2%99.6%31.0%0.08
1.5030.9%98.6%56.2%0.07
1.7513.9%98.4%81.2%0.10
2.006.0%98.2%91.6%0.10
2.250.4%97.9%99.0%0.00
2.500.0%97.9%99.2%0.00

Selected threshold: T=2.0

T=2.0 was selected as the best threshold: it achieves the highest F1 score (0.10) among all thresholds where draft accuracy is at or above 95%. At this value:
  • 94% draft acceptance — the heavyweight is called on only 6% of requests.
  • 98.2% draft accuracy — almost all accepted drafts are acceptable to serve.
  • 91.6% cost reduction — compared to routing every request to the heavyweight.
At T=2.25 and T=2.50, cost reduction approaches 99%, but F1 collapses to 0.00: almost nothing is escalated, so bad drafts that do exist are never caught. At T=1.00, accuracy is 100% but cost reduction drops to 8.2% because nearly 70% of requests hit the heavyweight. T=2.0 sits at the knee of the curve: the last threshold with meaningful escalation before the routing engine effectively stops escalating anything.

Tradeoff guidance

Lower thresholds are more conservative and more expensive. Higher thresholds are cheaper but riskier.

Lower T (e.g. T=1.5)

More escalations. Higher accuracy. Higher cost. Use when incorrect answers are expensive — customer-facing support, medical, legal, financial workloads.

Higher T (e.g. T=2.25)

Fewer escalations. Lower accuracy. Lower cost. Use when drafts are low-stakes and you are primarily optimizing for throughput cost.
Start conservative in production. Deploy at T=1.75 or T=2.0, monitor draftthinker_routing_decisions_total and your downstream accuracy signal for two weeks, then re-calibrate if the data supports a higher threshold. Cutting cost by 10% from 91.6% to ~99% saves little in absolute terms once you are already below 10% escalation — but the accuracy risk is harder to quantify without production data.
Known failure mode: confident hallucination. Entropy calibration cannot catch cases where the drafter generates a wrong answer with low uncertainty. If the drafter assigns near-uniform probability to a plausible-but-incorrect token sequence, entropy stays low, the response is accepted, and the error is served. This is a documented limitation, not a bug. Mitigations include periodic accuracy audits against a held-out labeled set, downstream user feedback loops, and a conservative threshold. See the README for the full discussion.

How to re-calibrate for your workload

The reference calibration was run on a general-purpose prompt set. If your workload is domain-specific — legal documents, code review, medical Q&A — you should re-calibrate against prompts representative of your traffic.
1

Prepare your prompt dataset

Edit or replace benchmarks/testdata/prompts.json. Each entry requires an id, a category (simple_factual, multi_step_reasoning, code_generation, or ambiguous_creative), and a text field.
{
  "prompts": [
    {
      "id": "my-prompt-001",
      "category": "simple_factual",
      "text": "What is the standard port for HTTPS?"
    }
  ]
}
2

Collect responses

Run the collector against your dataset. Set OPENAI_API_KEY before running.
go run ./benchmarks/cmd/collect/ \
  --prompts benchmarks/testdata/prompts.json \
  --output benchmarks/results/collected.jsonl \
  --concurrency 3
The collector calls both the drafter (gpt-4.1-nano) and heavyweight (gpt-4.1) for each prompt, runs LLM-as-judge evaluation, and writes the results to a JSONL file. The collection is resumable — already-collected prompt IDs are skipped on restart.
3

Run the threshold sweep

Sweep a range of thresholds over the collected data. Adjust the --thresholds list to include values relevant to your use case.
go run ./benchmarks/cmd/sweep/ \
  --input benchmarks/results/collected.jsonl \
  --output benchmarks/results/sweep.csv \
  --thresholds 0.5,0.75,1.0,1.25,1.5,1.75,2.0,2.25,2.5 \
  --window-size 10 \
  --early-exit-count 10
The sweep prints a summary table to stdout and writes full per-threshold metrics to the CSV file.
4

Update config.yaml

Set entropy.threshold to the selected threshold value. The sweep’s stdout summary prints the selected threshold directly.
config.yaml
entropy:
  threshold: 2.0   # replace with your selected value
  window_size: 10
  early_exit_count: 10
  top_logprobs: 5
Restart the gateway to apply the change.

Build docs developers (and LLMs) love