What is Draft Thinker?
Draft Thinker is a cost-aware LLM gateway written in Go. It sits between your application and LLM providers, routing each request through a fast, cheap model first and only escalating to an expensive frontier model when necessary. The result: 91.6% total cost of ownership (TCO) reduction compared to sending all traffic to a heavyweight model, with 98.2% accuracy on the draft path.The problem it solves
LLM-powered applications typically send 100% of traffic to frontier models regardless of query complexity. A question like “What are your hours?” costs the same as “Explain the tradeoffs between B-tree and LSM-tree storage engines.” This is wasteful in three ways:- Cost: 70%+ of queries are answerable by models costing 10–50x less.
- Latency: Frontier models have 2–5x higher time-to-first-token than small models.
- Scale: At high throughput, frontier model rate limits become the bottleneck, not your application.
The core insight
Draft Thinker solves this by analyzing the drafter model’s own confidence signals during generation. Every token a model produces comes with log-probabilities for its top candidates. High entropy (uncertainty) across those candidates means the model is guessing. Low entropy means it’s confident. The gateway watches these signals in real time as the drafter generates. If confidence stays high throughout, it ships the draft. If confidence drops, it escalates to the heavyweight. This makes routing decisions based on actual model behavior, not predicted query difficulty.Three core mechanisms
Entropy-based routing
Computes Shannon entropy over the drafter’s token log-probabilities using a sliding window of 10 tokens. If windowed entropy exceeds the calibrated threshold
T=2.0 bits at any point, the request is escalated. If the first 10 tokens already exceed T, the draft is aborted immediately to avoid wasting compute.Speculative execution
When early tokens show elevated but not yet critical uncertainty (entropy > 0.8 × T), Draft Thinker fires a parallel request to the heavyweight model. If the drafter recovers, the heavyweight call is canceled. If not, the heavyweight already has a head start — eliminating the full double-latency penalty of naive serial draft-then-verify.
Semantic cache
Previously verified prompt–response pairs are stored as embeddings in Qdrant. If an incoming prompt is semantically similar (cosine similarity > 0.95) to a cached entry, the response is returned directly — bypassing the entire draft-verify cycle. Only draft-accepted responses are cached; escalated responses are not.
OpenAI-compatible API
The gateway exposes a
POST /v1/chat/completions endpoint that is a drop-in replacement for the OpenAI API. The model field in the request is overridden internally; your application does not need to know which model handled the request.Key results
Calibrated on 518 prompts across four categories — simple factual, multi-step reasoning, code generation, and ambiguous/creative — using LLM-as-judge evaluation:| Metric | Value |
|---|---|
| TCO reduction vs. all-heavyweight | 91.6% (at T=2.0) |
| Draft acceptance rate | 94% of requests served by drafter |
| Accuracy on draft path | 98.2% acceptable (LLM-as-judge) |
| P99 latency (draft path) | 109 ms at 50 req/s |
| Proxy overhead | < 5 ms P99 |
| Calibrated threshold | T = 2.0 (Shannon entropy in bits, 10-token sliding window) |
Tech stack
| Component | Technology |
|---|---|
| Gateway | Go net/http — goroutines for concurrent I/O, no framework overhead |
| Entropy engine | Go math — pure math, no cross-language boundary |
| Drafter model | OpenAI gpt-4.1-nano — fast, cheap, returns logprobs |
| Heavyweight model | OpenAI gpt-4.1 — escalation target |
| Vector cache | Qdrant — nearest-neighbor lookup for semantic cache |
| KV store | Redis — TTLs, metadata, rate counters |
| Observability | Prometheus + Grafana — cost/request, entropy distributions, cache hit rate |
| Deployment | Docker Compose — single command spins up all services |
Known limitation: confident hallucination. The drafter can produce a wrong answer with low entropy — meaning the routing decision is “accept” but the output is incorrect. This is the fundamental limitation of entropy-based routing. It is mitigated by periodic accuracy audits, downstream feedback loops, and a conservative initial threshold. It is a documented tradeoff, not a bug.
Architecture overview
Next steps
Ready to run Draft Thinker locally?Quick start
Get Draft Thinker running in under five minutes.