The problem with prompt classifiers
A prompt classifier labels requests as “easy” or “hard” before any model touches them. This works on training distribution, but fails in production because questions that look syntactically simple can require complex reasoning depending on context. “What’s the current Fed rate?” is three words. It may also be stale, jurisdiction-specific, or require numerical calculation — none of which the classifier can detect. Entropy-based routing avoids this entirely. Rather than predicting difficulty before generation, it measures actual model confidence during generation. A drafter that is certain produces narrow, peaked token distributions. A drafter that is confused spreads probability mass across many candidates. That signal is available token-by-token and is robust to novel query patterns.Shannon entropy over logprobs
For each generated token, the drafter returns logprobs for the top-k candidates. The per-token entropy is:ComputeEntropy in internal/entropy/entropy.go converts them to probabilities, normalises, and applies the Shannon formula:
p /= sum) handles cases where the top-k logprobs do not sum to 1.0 — a common property of truncated distributions returned by hosted APIs.
Sliding window smoothing
Individual token entropy is noisy. Rare proper nouns, punctuation, and code identifiers all produce momentary entropy spikes that do not indicate reasoning failure. Computing a routing decision on every token would cause spurious escalations. The solution is a sliding window average over the last 10 tokens (WindowConfig.Size = 10). The window tracks a running sum and uses a circular buffer to evict the oldest value on each addition:
Escalate is returned only when the windowed average exceeds the calibrated threshold T. Before the window fills, the early-exit path applies (see below).
Early exit
If any of the firstEarlyExitCount tokens individually exceed T, the drafter is aborted immediately. There is no point completing a response that will be discarded. The router drains the drafter channel in a background goroutine to avoid blocking the upstream connection:
Routing decisions
The router produces one of two outcomes, recorded in thedraftthinker_routing_decisions_total counter under the decision label:
| Decision | Meaning |
|---|---|
accept | Windowed entropy stayed below T for the full response. Draft is served to the client. |
escalate | Windowed entropy exceeded T (or early exit fired). Request is forwarded to the heavyweight model. |
Calibrated threshold T = 2.0
T is not a guess — it is selected empirically. The sweep tool in benchmarks/cmd/sweep/ runs the drafter across a labelled benchmark set at each candidate threshold and records escalation rate, draft accuracy, cost reduction, and F1.
Calibration results — 518 prompts, gpt-4.1-nano drafter, gpt-4.1 heavyweight:
| Threshold | Escalation rate | Draft accuracy | Cost reduction | F1 |
|---|---|---|---|---|
| 1.00 | 68.9% | 100.0% | 8.2% | 0.06 |
| 1.25 | 49.2% | 99.6% | 31.0% | 0.08 |
| 1.50 | 30.9% | 98.6% | 56.2% | 0.07 |
| 1.75 | 13.9% | 98.4% | 81.2% | 0.10 |
| 2.00 | 6.0% | 98.2% | 91.6% | 0.10 |
| 2.25 | 0.4% | 97.9% | 99.0% | 0.00 |
| 2.50 | 0.0% | 97.9% | 99.2% | 0.00 |