Introduction

What is Draft Thinker?

Draft Thinker is a cost-aware LLM gateway written in Go. It sits between your application and LLM providers, routing each request through a fast, cheap model first and only escalating to an expensive frontier model when necessary. The result: 91.6% total cost of ownership (TCO) reduction compared to sending all traffic to a heavyweight model, with 98.2% accuracy on the draft path.

The problem it solves

LLM-powered applications typically send 100% of traffic to frontier models regardless of query complexity. A question like “What are your hours?” costs the same as “Explain the tradeoffs between B-tree and LSM-tree storage engines.” This is wasteful in three ways:

Cost: 70%+ of queries are answerable by models costing 10–50x less.
Latency: Frontier models have 2–5x higher time-to-first-token than small models.
Scale: At high throughput, frontier model rate limits become the bottleneck, not your application.

The hard part is knowing when the cheap model is good enough — without already having the correct answer. Prompt classifiers that predict difficulty before generation fail on distribution shift: a syntactically simple question can require complex reasoning depending on context.

The core insight

Draft Thinker solves this by analyzing the drafter model’s own confidence signals during generation. Every token a model produces comes with log-probabilities for its top candidates. High entropy (uncertainty) across those candidates means the model is guessing. Low entropy means it’s confident. The gateway watches these signals in real time as the drafter generates. If confidence stays high throughout, it ships the draft. If confidence drops, it escalates to the heavyweight. This makes routing decisions based on actual model behavior, not predicted query difficulty.

Three core mechanisms

Entropy-based routing

Computes Shannon entropy over the drafter’s token log-probabilities using a sliding window of 10 tokens. If windowed entropy exceeds the calibrated threshold T=2.0 bits at any point, the request is escalated. If the first 10 tokens already exceed T, the draft is aborted immediately to avoid wasting compute.

Speculative execution

When early tokens show elevated but not yet critical uncertainty (entropy > 0.8 × T), Draft Thinker fires a parallel request to the heavyweight model. If the drafter recovers, the heavyweight call is canceled. If not, the heavyweight already has a head start — eliminating the full double-latency penalty of naive serial draft-then-verify.

Semantic cache

Previously verified prompt–response pairs are stored as embeddings in Qdrant. If an incoming prompt is semantically similar (cosine similarity > 0.95) to a cached entry, the response is returned directly — bypassing the entire draft-verify cycle. Only draft-accepted responses are cached; escalated responses are not.

OpenAI-compatible API

The gateway exposes a POST /v1/chat/completions endpoint that is a drop-in replacement for the OpenAI API. The model field in the request is overridden internally; your application does not need to know which model handled the request.

Key results

Calibrated on 518 prompts across four categories — simple factual, multi-step reasoning, code generation, and ambiguous/creative — using LLM-as-judge evaluation:

Metric	Value
TCO reduction vs. all-heavyweight	91.6% (at T=2.0)
Draft acceptance rate	94% of requests served by drafter
Accuracy on draft path	98.2% acceptable (LLM-as-judge)
P99 latency (draft path)	109 ms at 50 req/s
Proxy overhead	< 5 ms P99
Calibrated threshold	T = 2.0 (Shannon entropy in bits, 10-token sliding window)

Tech stack

Component	Technology
Gateway	Go `net/http` — goroutines for concurrent I/O, no framework overhead
Entropy engine	Go `math` — pure math, no cross-language boundary
Drafter model	OpenAI `gpt-4.1-nano` — fast, cheap, returns logprobs
Heavyweight model	OpenAI `gpt-4.1` — escalation target
Vector cache	Qdrant — nearest-neighbor lookup for semantic cache
KV store	Redis — TTLs, metadata, rate counters
Observability	Prometheus + Grafana — cost/request, entropy distributions, cache hit rate
Deployment	Docker Compose — single command spins up all services

No Python is in the hot path. The draft-verify state machine is a Go switch statement. Cross-language IPC would add latency that contradicts the project’s core value proposition.

Known limitation: confident hallucination. The drafter can produce a wrong answer with low entropy — meaning the routing decision is “accept” but the output is incorrect. This is the fundamental limitation of entropy-based routing. It is mitigated by periodic accuracy audits, downstream feedback loops, and a conservative initial threshold. It is a documented tradeoff, not a bug.

Architecture overview

Client
  │
  ▼
┌─────────────────────────────────────────────────┐
│                  GATEWAY (Go)                   │
│                                                 │
│  ┌───────────┐   ┌──────────┐   ┌───────────┐  │
│  │  Ingress  │──▶│ Semantic │──▶│  Drafter   │  │
│  │  (Auth,   │   │  Cache   │   │   Pool     │  │
│  │  Validate)│   │ (Qdrant) │   │  (OpenAI)  │  │
│  └───────────┘   └────┬─────┘   └─────┬──────┘  │
│                  HIT   │              │ tokens   │
│                   │    │         ┌────▼───────┐  │
│                   │    │         │  Entropy   │  │
│                   │    │         │  Analyzer  │  │
│                   │    │         └────┬───────┘  │
│                   │    │     LOW ┌────┴────┐HIGH │
│                   │    │         │         │     │
│                   ▼    │         ▼         ▼     │
│              ┌────────────┐  ACCEPT   ESCALATE   │
│              │  Response  │◀───┘    ┌────▼────┐  │
│              │  (cached)  │         │Heavy API│  │
│              └─────┬──────┘         │(OpenAI) │  │
│                    │                └────┬────┘  │
│                    ▼                     ▼       │
│              ┌──────────────────────────────┐    │
│              │         Response             │    │
│              └──────────────────────────────┘    │
└─────────────────────────────────────────────────┘
  │
  ▼
Client

Next steps

Ready to run Draft Thinker locally?

Quick start

Get Draft Thinker running in under five minutes.

Get Started

How It Works

Deployment

Observability

What is Draft Thinker?

The problem it solves

The core insight

Three core mechanisms

Entropy-based routing

Speculative execution

Semantic cache

OpenAI-compatible API

Key results

Tech stack

Architecture overview

Next steps

Quick start

Build docs developers (and LLMs) love

Get Started

How It Works

Deployment

Observability

​What is Draft Thinker?

​The problem it solves

​The core insight

​Three core mechanisms

Entropy-based routing

Speculative execution

Semantic cache

OpenAI-compatible API

​Key results

​Tech stack

​Architecture overview

​Next steps

Quick start

Build docs developers (and LLMs) love

What is Draft Thinker?

The problem it solves

The core insight

Three core mechanisms

Key results

Tech stack

Architecture overview

Next steps