Skip to main content
When a hard question must be escalated, naive draft-then-verify doubles the user-facing latency: the drafter runs to completion (or abort), then the heavyweight starts from zero. Speculative execution eliminates most of that penalty by starting the heavyweight before the escalation decision is final.

The problem

Consider a request that the drafter cannot answer well:
  1. Drafter runs for 800 ms before entropy exceeds T
  2. Drafter is aborted, heavyweight is started
  3. Heavyweight runs for 1,200 ms
  4. Total latency: 2,000 ms
The user waited for a draft that was never used. If the heavyweight had started earlier, the 800 ms of drafter time would overlap with heavyweight time rather than precede it.

The solution: a soft threshold

Speculative execution introduces a soft threshold at 0.8 * T. When any token’s entropy exceeds this lower value while the window is still filling, Draft Thinker fires a parallel heavyweight request immediately — before committing to escalation. Three outcomes are then possible:
  • Entropy drops back below T: the drafter recovers. The heavyweight call is cancelled. The draft is accepted at no extra latency cost.
  • Entropy stays high and exceeds T: the hard threshold fires. The heavyweight already has a head start. Additional user-facing latency = heavyweight_total - drafter_abort_time, not heavyweight_total.
  • Entropy never exceeded 0.8 * T: speculative execution never triggered. Normal draft-accept path.
The saved latency is measured and recorded as draftthinker_speculative_latency_saved_seconds — the duration between when the speculative call was fired and when the hard escalation was confirmed.

State machine

         ┌──────────┐
         │ DRAFTING  │
         └────┬──────┘

     entropy > 0.8*T?
        ┌─────┴─────┐
        │ NO        │ YES
        ▼           ▼
  ┌──────────┐ ┌─────────────┐
  │ DRAFT    │ │ SPECULATING │
  │ ACCEPTED │ └──────┬──────┘
  └──────────┘        │
               entropy < T?
              ┌───────┴───────┐
              │ YES           │ NO
              ▼               ▼
        ┌──────────┐   ┌───────────┐
        │ DRAFT    │   │ ESCALATED │
        │ ACCEPTED │   │ (heavy    │
        │ (cancel  │   │  response)│
        │  heavy)  │   └───────────┘
        └──────────┘
The four states are defined in internal/speculative/state.go:
const (
	Drafting State = iota
	Speculating
	Escalated
	DraftAccepted
)

Request lifecycle

1

Drafter starts streaming

The gateway forwards the request to gpt-4.1-nano with logprobs=true, stream=true. Tokens arrive and are fed to the entropy window one by one.
2

Soft threshold check per token

For each token, the executor checks whether the raw token entropy exceeds softThreshold = WindowCfg.Threshold * SoftThresholdMult. The default SoftThresholdMult is 0.8, giving a soft threshold of 1.6 when T = 2.0.
3

Speculative call fired (if soft threshold exceeded)

On the first token that exceeds the soft threshold, Execute transitions to Speculating, records the start time, increments speculative_triggers_total, and opens a cancellable parallel request to the heavyweight model:
if !softTriggered && te.Entropy > softThreshold {
    softTriggered = true
    specStart = time.Now()
    e.recorder.RecordSpeculativeTrigger()

    hvCtx, cancel := context.WithCancel(ctx)
    ch, err := e.heavyweight.Stream(hvCtx, req)
    // ...
    heavyCh = ch
    heavyCancel = cancel
    result.State = Speculating
}
4

Hard threshold check continues

The drafter keeps generating. The window continues accumulating entropy. The hard threshold check runs as normal.
5

Outcome A — draft accepted

The drafter finishes without the window average exceeding T. If a speculative call is running, it is cancelled and the channel is drained. speculative_cancellations_total is incremented. The draft response is returned.
6

Outcome B — escalated with head start

The window average exceeds T. The executor transitions to Escalated, records speculative_latency_saved_seconds as time.Since(specStart), and returns the already-running heavyweight channel to the handler, which streams it directly to the client.
if decision == entropy.Escalate {
    result.State = Escalated
    if softTriggered && heavyCh != nil {
        result.HeavyCh = heavyCh
        result.HeavyCancel = heavyCancel
        e.recorder.RecordSpeculativeLatencySaved(time.Since(specStart))
    }
    // ...
    return result, nil
}

Configuration

The soft threshold multiplier is set via ExecutorConfig.SoftThresholdMult:
type ExecutorConfig struct {
	WindowCfg         entropy.WindowConfig
	SoftThresholdMult float64
}
The default is 0.8. Lowering it fires the speculative call earlier (more head start, higher cancellation rate). Raising it fires later (less wasted compute, smaller latency savings).

Cost tradeoff

Speculative execution wastes compute when the drafter recovers: the heavyweight call is cancelled before it produces output, but the provider has already received and begun processing the request. From the system design:
Speculative execution trades a small amount of wasted compute (~5–10% of escalated request costs) for significantly better tail latency. Net TCO impact is minimal: only ~30% of requests trigger escalation, and of those, the speculative call is canceled ~40% of the time (drafter recovers).
The wasted compute is bounded: only requests in the SPECULATING state that resolve to DRAFT_ACCEPTED generate a cancelled speculative call. Requests that never hit the soft threshold, and requests that do escalate, incur no waste.

Phase 4 metrics

MetricTypeDescription
draftthinker_speculative_triggers_totalCounterTotal speculative heavyweight calls fired (soft threshold exceeded)
draftthinker_speculative_cancellations_totalCounterSpeculative calls cancelled (drafter recovered before hard threshold)
draftthinker_speculative_latency_saved_secondsHistogramHead-start time saved on escalated requests that had a running speculative call
speculative_latency_saved_seconds uses the same bucket boundaries as upstream_latency_seconds: 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30.

Interpreting the metrics

  • Trigger rate = speculative_triggers_total / requests_total. Higher means more requests are hitting the soft threshold. This is workload-dependent and does not indicate a problem on its own.
  • Latency saved = the histogram shows the head-start duration per escalated request. Higher p50 and p99 values mean speculative execution is eliminating more tail latency.
Monitor the cancellation ratio (speculative_cancellations_total / speculative_triggers_total). This is your wasted-compute gauge. The design target is that cancelled speculative calls cost less than 10% of total escalation cost. If the ratio climbs above ~60–70%, consider raising SoftThresholdMult to fire the speculative call later and reduce false triggers.

Build docs developers (and LLMs) love