Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Autoregressive LLM inference is slow because every token requires a full forward pass through the model. Speculative decoding breaks this bottleneck by generating several candidate tokens cheaply with a small draft model, then verifying them in a single parallel forward pass through the large target model. Lecture 22, presented by Cade Daniel, is a hands-on guide to how this works inside vLLM. This page covers the algorithm, vLLM’s implementation, and practical configuration.

The LLM inference bottleneck

Standard autoregressive decoding generates one token per step:
  1. Run a forward pass through the full model (e.g. 70B parameters).
  2. Sample the next token from the output logits.
  3. Append the token to the context and repeat.
Each forward pass is memory-bandwidth bound at typical serving batch sizes: the GPU spends most of its time loading model weights from HBM, not doing matrix multiplications. This means:
  • Compute utilization is low (often 10–30% of peak FLOPS).
  • Latency per token scales with model size regardless of sequence length.
  • The GPU is underutilized — there are spare FLOPs that could do useful work.
Speculative decoding exploits those spare FLOPs by running a small draft model to propose tokens, then verifying several proposals in one big-model pass.

Speculative decoding algorithm

The algorithm has two phases per decoding step:
1

Draft phase

A small, fast draft model (or heuristic) generates k candidate tokens autoregressively. This is cheap because the draft model is much smaller than the target.
2

Verify phase

The large target model processes the original context plus all k draft tokens in a single forward pass, producing k+1 output distributions in parallel.
3

Accept or reject

Each draft token is accepted or rejected by comparing the draft model’s probability to the target model’s probability. Accepted tokens are kept; the first rejected token is resampled from a corrected distribution. On average, α * k + 1 tokens are produced per step, where α is the acceptance rate.
Speculative decoding is lossless: the output distribution of the target model is exactly preserved. Rejected draft tokens are replaced by samples from the target distribution, not discarded silently.
The speedup comes from amortizing the target model’s memory-bandwidth cost over multiple tokens. If the draft model proposes k=4 tokens and the acceptance rate is 80%, you produce roughly 4.2 tokens per target forward pass instead of 1 — a ~4× throughput gain on memory-bandwidth-bound hardware.

Acceptance rate and when speculative decoding helps

The acceptance rate α measures how often the draft model’s predictions match the target model. A higher α means more tokens accepted per step. Speculative decoding helps most when:
  • The task has predictable token sequences (code, structured outputs, repetitive text).
  • The draft model is well-matched to the target model’s distribution.
  • Batch sizes are small (1–4 requests), keeping the target model memory-bandwidth bound.
It hurts when:
  • The acceptance rate is low (the overhead of running the draft model exceeds the savings).
  • Batch sizes are large — the target model is already compute-bound and parallel verification adds overhead.
Speculative decoding typically hurts throughput at large batch sizes. It is primarily a latency optimization for interactive, low-batch serving.

vLLM’s speculative decoding implementation

vLLM integrates speculative decoding as a first-class scheduling mode. The key components:
  • SpecDecodeWorker — wraps a draft worker and a target worker, orchestrates the propose/verify loop.
  • BatchExpansionTop1Scorer — expands each request’s KV cache with the draft tokens, runs the target model, and scores each position.
  • SpecDecodeScheduler — a modified scheduler that groups requests by speculative width k.
The verification step uses vLLM’s existing PagedAttention infrastructure: draft tokens are appended to the KV cache pages, and the target model attends over them in a single pass.

Draft model options

vLLM supports several draft strategies:

Small language model

A smaller version of the target model (e.g. Llama-3-8B drafting for Llama-3-70B). Best acceptance rates but requires a second model in memory.

N-gram matching

Proposes tokens by looking for matching n-grams in the prompt. Zero extra memory, works well for tasks with verbatim repetition (e.g. summarization, RAG).

Medusa heads

Extra decoding heads attached to the target model that predict future tokens. Single model, no draft model overhead, but requires fine-tuning.

Draft model from same family

E.g. using a quantized or pruned variant of the target model. Balances memory cost and acceptance rate.

Batch speculative decoding challenges

Speculative decoding is straightforward for a single request, but vLLM must serve many requests simultaneously. This creates several complications: Variable acceptance lengths: Each request accepts a different number of draft tokens, so the batch becomes ragged after verification. vLLM must handle requests that accepted 0 tokens alongside requests that accepted all k. KV cache overhead: Draft tokens must be appended to the KV cache before verification. If the draft tokens are mostly rejected, these cache writes are wasted. vLLM manages this with speculative KV cache slots that are only committed after acceptance. Scheduling constraints: Requests using speculative decoding and requests not using it must be scheduled separately, as they have different memory footprints and compute patterns.
vLLM’s implementation uses a “token budget” approach: the scheduler caps total draft tokens across the batch so that the target model’s verification pass does not exceed a memory or compute budget.

Performance benchmarks and tradeoffs

ScenarioExpected speedupNotes
Code generation, bs=12–4×High acceptance rate, memory-BW bound
Chat, bs=11.5–2.5×Moderate acceptance rate
Summarization with n-gram1.5–3×No draft model memory cost
Large batch (bs≥16)<1× (regression)Target model becomes compute-bound
The key metric is tokens per second per request (TPOR), not total throughput. Speculative decoding trades batch throughput for per-request latency.

Configuration in vLLM

Enable speculative decoding when launching the vLLM server:
# Using a small draft model
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b-instruct \
    --speculative-model meta-llama/Llama-3-8b-instruct \
    --num-speculative-tokens 5
# Using n-gram draft (no extra model required)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b-instruct \
    --speculative-model "[ngram]" \
    --ngram-prompt-lookup-min 4 \
    --ngram-prompt-lookup-max 8 \
    --num-speculative-tokens 5
# Programmatic configuration via LLM class
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70b-instruct",
    speculative_model="meta-llama/Llama-3-8b-instruct",
    num_speculative_tokens=5,
)

outputs = llm.generate(
    ["Write a Python function that sorts a list"],
    SamplingParams(temperature=0.8, max_tokens=256),
)
Start with num_speculative_tokens=3 and profile acceptance rate before increasing. Very high k values reduce acceptance rate and can negate the speedup gains.

Key configuration parameters

ParameterDescription
--speculative-modelDraft model name or [ngram]
--num-speculative-tokensDraft tokens per step (k)
--ngram-prompt-lookup-minMin n-gram length for matching
--ngram-prompt-lookup-maxMax n-gram length for matching
--speculative-max-model-lenMax context length for draft model
--speculative-disable-by-batch-sizeDisable spec decoding above this batch size

Lecture 22 slides

Cade Daniel’s original Hacker’s Guide to Speculative Decoding in vLLM

GPU Mode Discord

Ask questions and discuss speculative decoding and LLM inference

Build docs developers (and LLMs) love