Skip to main content
Speculative decoding accelerates token generation by predicting multiple tokens ahead of the main model, then verifying them in a single batch. Because batch-processing tokens (as in prompt processing) is faster than generating them sequentially, correct draft predictions result in a net speedup. The higher the acceptance rate, the greater the gain. llama-server supports several speculative decoding implementations. A draft model can also be combined with a draftless implementation — when combined, the draftless type takes precedence.

Implementations

Draft model (draft)

A small secondary model (the draft model) generates candidate tokens that the main model then verifies in a batch. This is the most widely used speculative decoding approach and works well across all kinds of content.When to use: general-purpose acceleration where a suitable small draft model exists for your target model family.
llama-server \
  --model main-model.gguf \
  --model-draft draft-model.gguf \
  --draft-max 16
Key flags:
  • --model-draft — path to the draft model GGUF
  • --draft-max / --draft — maximum tokens to draft per step (default: 16)
  • --draft-min — minimum draft length before the main model verifies
  • --draft-p-min — minimum probability threshold for greedy draft selection (default: 0.8)

Key command-line flags

FlagDefaultDescription
--spec-type TYPEnoneSpeculative decoding type (see table below)
--draft-max N16Maximum tokens to draft per verification step
--draft-min N0Minimum draft length before the main model verifies
--draft-p-min P0.8Minimum probability threshold for greedy draft selection
--spec-ngram-size-n N12Length of the lookup n-gram
--spec-ngram-size-m M48Length of the draft m-gram
--spec-ngram-min-hits H1Minimum occurrences before a pattern is used as a draft

--spec-type values

ValueDescription
noneNo speculative decoding (default)
ngram-cacheN-gram cache with probability statistics
ngram-simpleSimple n-gram pattern matching
ngram-map-kN-gram map with key-based hash map
ngram-map-k4vN-gram map with up to four tracked values per key (experimental)
ngram-modRolling LCG hash pool, shared across server slots

Statistics output

Each speculative decoding implementation prints statistics at the end of each request. Use them to tune your configuration. Draft model + ngram-simple combined:
draft acceptance rate = 0.57576 (  171 accepted /   297 generated)
statistics ngram_simple: #calls = 15, #gen drafts = 5, #acc drafts = 5, #gen tokens = 187, #acc tokens = 73
statistics draft: #calls = 10, #gen drafts = 10, #acc drafts = 10, #gen tokens = 110, #acc tokens = 98
ngram-mod:
draft acceptance rate = 0.70312 (   90 accepted /   128 generated)
statistics ngram_mod: #calls = 810, #gen drafts = 15, #acc drafts = 15, #gen tokens = 960, #acc tokens = 730, dur(b,g,a) = 0.149, 0.347, 0.005 ms
ngram-map-k:
statistics ngram_map_k: #calls(b,g,a) = 6 1690 26, #gen drafts = 26, #acc drafts = 26, #gen tokens = 1248, #acc tokens = 968, dur(b,g,a) = 2.234, 1.427, 0.016 ms
Field definitions:
FieldMeaning
draft acceptance rateFraction of draft tokens accepted by the main model
#calls(b,g,a)Number of calls: begin (new prompt), generation, accumulation
#gen draftsNumber of draft batches generated
#acc draftsNumber of draft batches at least partially accepted
#gen tokensTotal tokens generated (including rejected)
#acc tokensTotal tokens accepted by the main model
dur(b,g,a)Duration in ms for begin, generation, and accumulation phases
A high #acc tokens / #gen tokens ratio means your draft configuration is well-suited to the content. If the ratio is low, try a different --spec-type, adjust --spec-ngram-size-n, or reduce --draft-max.

Build docs developers (and LLMs) love