Autoregressive LLM inference is slow because every token requires a full forward pass through the model. Speculative decoding breaks this bottleneck by generating several candidate tokens cheaply with a small draft model, then verifying them in a single parallel forward pass through the large target model. Lecture 22, presented by Cade Daniel, is a hands-on guide to how this works inside vLLM. This page covers the algorithm, vLLM’s implementation, and practical configuration.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
The LLM inference bottleneck
Standard autoregressive decoding generates one token per step:- Run a forward pass through the full model (e.g. 70B parameters).
- Sample the next token from the output logits.
- Append the token to the context and repeat.
- Compute utilization is low (often 10–30% of peak FLOPS).
- Latency per token scales with model size regardless of sequence length.
- The GPU is underutilized — there are spare FLOPs that could do useful work.
Speculative decoding algorithm
The algorithm has two phases per decoding step:Draft phase
A small, fast draft model (or heuristic) generates
k candidate tokens autoregressively. This is cheap because the draft model is much smaller than the target.Verify phase
The large target model processes the original context plus all
k draft tokens in a single forward pass, producing k+1 output distributions in parallel.Accept or reject
Each draft token is accepted or rejected by comparing the draft model’s probability to the target model’s probability. Accepted tokens are kept; the first rejected token is resampled from a corrected distribution. On average,
α * k + 1 tokens are produced per step, where α is the acceptance rate.Speculative decoding is lossless: the output distribution of the target model is exactly preserved. Rejected draft tokens are replaced by samples from the target distribution, not discarded silently.
k=4 tokens and the acceptance rate is 80%, you produce roughly 4.2 tokens per target forward pass instead of 1 — a ~4× throughput gain on memory-bandwidth-bound hardware.
Acceptance rate and when speculative decoding helps
The acceptance rateα measures how often the draft model’s predictions match the target model. A higher α means more tokens accepted per step.
Speculative decoding helps most when:
- The task has predictable token sequences (code, structured outputs, repetitive text).
- The draft model is well-matched to the target model’s distribution.
- Batch sizes are small (1–4 requests), keeping the target model memory-bandwidth bound.
- The acceptance rate is low (the overhead of running the draft model exceeds the savings).
- Batch sizes are large — the target model is already compute-bound and parallel verification adds overhead.
vLLM’s speculative decoding implementation
vLLM integrates speculative decoding as a first-class scheduling mode. The key components:SpecDecodeWorker— wraps a draft worker and a target worker, orchestrates the propose/verify loop.BatchExpansionTop1Scorer— expands each request’s KV cache with the draft tokens, runs the target model, and scores each position.SpecDecodeScheduler— a modified scheduler that groups requests by speculative widthk.
Draft model options
vLLM supports several draft strategies:Small language model
A smaller version of the target model (e.g. Llama-3-8B drafting for Llama-3-70B). Best acceptance rates but requires a second model in memory.
N-gram matching
Proposes tokens by looking for matching n-grams in the prompt. Zero extra memory, works well for tasks with verbatim repetition (e.g. summarization, RAG).
Medusa heads
Extra decoding heads attached to the target model that predict future tokens. Single model, no draft model overhead, but requires fine-tuning.
Draft model from same family
E.g. using a quantized or pruned variant of the target model. Balances memory cost and acceptance rate.
Batch speculative decoding challenges
Speculative decoding is straightforward for a single request, but vLLM must serve many requests simultaneously. This creates several complications: Variable acceptance lengths: Each request accepts a different number of draft tokens, so the batch becomes ragged after verification. vLLM must handle requests that accepted 0 tokens alongside requests that accepted allk.
KV cache overhead: Draft tokens must be appended to the KV cache before verification. If the draft tokens are mostly rejected, these cache writes are wasted. vLLM manages this with speculative KV cache slots that are only committed after acceptance.
Scheduling constraints: Requests using speculative decoding and requests not using it must be scheduled separately, as they have different memory footprints and compute patterns.
vLLM’s implementation uses a “token budget” approach: the scheduler caps total draft tokens across the batch so that the target model’s verification pass does not exceed a memory or compute budget.
Performance benchmarks and tradeoffs
| Scenario | Expected speedup | Notes |
|---|---|---|
| Code generation, bs=1 | 2–4× | High acceptance rate, memory-BW bound |
| Chat, bs=1 | 1.5–2.5× | Moderate acceptance rate |
| Summarization with n-gram | 1.5–3× | No draft model memory cost |
| Large batch (bs≥16) | <1× (regression) | Target model becomes compute-bound |
Configuration in vLLM
Enable speculative decoding when launching the vLLM server:Key configuration parameters
| Parameter | Description |
|---|---|
--speculative-model | Draft model name or [ngram] |
--num-speculative-tokens | Draft tokens per step (k) |
--ngram-prompt-lookup-min | Min n-gram length for matching |
--ngram-prompt-lookup-max | Max n-gram length for matching |
--speculative-max-model-len | Max context length for draft model |
--speculative-disable-by-batch-size | Disable spec decoding above this batch size |
Lecture 22 slides
Cade Daniel’s original Hacker’s Guide to Speculative Decoding in vLLM
GPU Mode Discord
Ask questions and discuss speculative decoding and LLM inference