llama-server supports several speculative decoding implementations. A draft model can also be combined with a draftless implementation — when combined, the draftless type takes precedence.
Implementations
- Draft model
- ngram-simple
- ngram-map-k
- ngram-map-k4v
- ngram-mod
- ngram-cache
Draft model (draft)
A small secondary model (the draft model) generates candidate tokens that the main model then verifies in a batch. This is the most widely used speculative decoding approach and works well across all kinds of content.When to use: general-purpose acceleration where a suitable small draft model exists for your target model family.--model-draft— path to the draft model GGUF--draft-max/--draft— maximum tokens to draft per step (default: 16)--draft-min— minimum draft length before the main model verifies--draft-p-min— minimum probability threshold for greedy draft selection (default: 0.8)
Key command-line flags
| Flag | Default | Description |
|---|---|---|
--spec-type TYPE | none | Speculative decoding type (see table below) |
--draft-max N | 16 | Maximum tokens to draft per verification step |
--draft-min N | 0 | Minimum draft length before the main model verifies |
--draft-p-min P | 0.8 | Minimum probability threshold for greedy draft selection |
--spec-ngram-size-n N | 12 | Length of the lookup n-gram |
--spec-ngram-size-m M | 48 | Length of the draft m-gram |
--spec-ngram-min-hits H | 1 | Minimum occurrences before a pattern is used as a draft |
--spec-type values
| Value | Description |
|---|---|
none | No speculative decoding (default) |
ngram-cache | N-gram cache with probability statistics |
ngram-simple | Simple n-gram pattern matching |
ngram-map-k | N-gram map with key-based hash map |
ngram-map-k4v | N-gram map with up to four tracked values per key (experimental) |
ngram-mod | Rolling LCG hash pool, shared across server slots |
Statistics output
Each speculative decoding implementation prints statistics at the end of each request. Use them to tune your configuration. Draft model + ngram-simple combined:| Field | Meaning |
|---|---|
draft acceptance rate | Fraction of draft tokens accepted by the main model |
#calls(b,g,a) | Number of calls: begin (new prompt), generation, accumulation |
#gen drafts | Number of draft batches generated |
#acc drafts | Number of draft batches at least partially accepted |
#gen tokens | Total tokens generated (including rejected) |
#acc tokens | Total tokens accepted by the main model |
dur(b,g,a) | Duration in ms for begin, generation, and accumulation phases |
#acc tokens / #gen tokens ratio means your draft configuration is well-suited to the content. If the ratio is low, try a different --spec-type, adjust --spec-ngram-size-n, or reduce --draft-max.