The problem
Consider a request that the drafter cannot answer well:- Drafter runs for 800 ms before entropy exceeds T
- Drafter is aborted, heavyweight is started
- Heavyweight runs for 1,200 ms
- Total latency: 2,000 ms
The solution: a soft threshold
Speculative execution introduces a soft threshold at0.8 * T. When any token’s entropy exceeds this lower value while the window is still filling, Draft Thinker fires a parallel heavyweight request immediately — before committing to escalation.
Three outcomes are then possible:
- Entropy drops back below T: the drafter recovers. The heavyweight call is cancelled. The draft is accepted at no extra latency cost.
- Entropy stays high and exceeds T: the hard threshold fires. The heavyweight already has a head start. Additional user-facing latency =
heavyweight_total - drafter_abort_time, notheavyweight_total. - Entropy never exceeded 0.8 * T: speculative execution never triggered. Normal draft-accept path.
draftthinker_speculative_latency_saved_seconds — the duration between when the speculative call was fired and when the hard escalation was confirmed.
State machine
internal/speculative/state.go:
Request lifecycle
Drafter starts streaming
The gateway forwards the request to gpt-4.1-nano with
logprobs=true, stream=true. Tokens arrive and are fed to the entropy window one by one.Soft threshold check per token
For each token, the executor checks whether the raw token entropy exceeds
softThreshold = WindowCfg.Threshold * SoftThresholdMult. The default SoftThresholdMult is 0.8, giving a soft threshold of 1.6 when T = 2.0.Speculative call fired (if soft threshold exceeded)
On the first token that exceeds the soft threshold,
Execute transitions to Speculating, records the start time, increments speculative_triggers_total, and opens a cancellable parallel request to the heavyweight model:Hard threshold check continues
The drafter keeps generating. The window continues accumulating entropy. The hard threshold check runs as normal.
Outcome A — draft accepted
The drafter finishes without the window average exceeding
T. If a speculative call is running, it is cancelled and the channel is drained. speculative_cancellations_total is incremented. The draft response is returned.Configuration
The soft threshold multiplier is set viaExecutorConfig.SoftThresholdMult:
0.8. Lowering it fires the speculative call earlier (more head start, higher cancellation rate). Raising it fires later (less wasted compute, smaller latency savings).
Cost tradeoff
Speculative execution wastes compute when the drafter recovers: the heavyweight call is cancelled before it produces output, but the provider has already received and begun processing the request. From the system design:Speculative execution trades a small amount of wasted compute (~5–10% of escalated request costs) for significantly better tail latency. Net TCO impact is minimal: only ~30% of requests trigger escalation, and of those, the speculative call is canceled ~40% of the time (drafter recovers).The wasted compute is bounded: only requests in the
SPECULATING state that resolve to DRAFT_ACCEPTED generate a cancelled speculative call. Requests that never hit the soft threshold, and requests that do escalate, incur no waste.
Phase 4 metrics
| Metric | Type | Description |
|---|---|---|
draftthinker_speculative_triggers_total | Counter | Total speculative heavyweight calls fired (soft threshold exceeded) |
draftthinker_speculative_cancellations_total | Counter | Speculative calls cancelled (drafter recovered before hard threshold) |
draftthinker_speculative_latency_saved_seconds | Histogram | Head-start time saved on escalated requests that had a running speculative call |
speculative_latency_saved_seconds uses the same bucket boundaries as upstream_latency_seconds: 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30.
Interpreting the metrics
- Trigger rate =
speculative_triggers_total/requests_total. Higher means more requests are hitting the soft threshold. This is workload-dependent and does not indicate a problem on its own. - Latency saved = the histogram shows the head-start duration per escalated request. Higher p50 and p99 values mean speculative execution is eliminating more tail latency.