How Speculative Decoding Works
Speculative decoding follows this process:- Draft Phase: A lightweight draft model quickly generates multiple candidate tokens
- Verification Phase: The target model verifies these candidates in parallel
- Acceptance: Verified tokens are accepted, rejected tokens trigger re-generation
- The draft model is much smaller and faster than the target
- Verification is more efficient than token-by-token generation
- Multiple tokens can be verified in a single forward pass
Quick Start with EAGLE
For models with MTP (Medusa-style) layers (e.g., GLM-4.7, DeepSeek-V3/R1), simply add these flags:Using a Separate Draft Model
If you want to use a separately trained draft model (e.g., trained with SpecForge), also set:For detailed parameter meanings and configuration options, see SGLang’s speculative decoding documentation.
Configuration Parameters
Here are the key parameters for configuring speculative decoding:| Parameter | Description | Default |
|---|---|---|
--sglang-speculative-algorithm | Algorithm to use (e.g., EAGLE, Medusa) | None |
--sglang-speculative-num-steps | Number of speculative steps | 3 |
--sglang-speculative-eagle-topk | Top-k value for EAGLE | 1 |
--sglang-speculative-num-draft-tokens | Number of draft tokens to generate | 4 |
--sglang-speculative-draft-model-path | Path to external draft model | None |
Online SFT for the Draft Model
A key challenge in RL with speculative decoding is distribution drift: as RL training progresses, the sampling distributions of the draft and target models diverge. This causes:- Fewer draft tokens pass verification
- Reduced speedup or even negative returns
- Wasted computation on rejected drafts
The Solution: Online MTP Training
Slime supports online training of MTP layers during RL, updating the draft model in sync with the target model to maintain sampling speed. This approach is detailed in this blog post.Enabling Online MTP Training
Add these flags to your training configuration:How It Works
During training:- Forward Pass: Both target model and MTP layers process the input
- Loss Computation: MTP loss is computed with scaling factor
mtp-loss-scaling-factor - Backward Pass: Gradients flow through both target and MTP layers
- Weight Update: Both models are updated simultaneously
- Inference: Updated MTP layers are used immediately for next rollout
Example Configuration
Here’s a complete example for training with speculative decoding:Checkpoint Conversion with MTP
When converting checkpoints from HuggingFace to torch_dist format, include MTP layers:Without
--mtp-num-layers 1 during conversion, the checkpoint will not contain MTP weights and online training will fail.Performance Benefits
Speculative decoding with online MTP training can provide significant speedups:2-3x Throughput
Typical speedup for large models with well-tuned draft models
Maintained Speedup
Online training prevents degradation over the course of RL training
No Quality Loss
Speculative decoding is mathematically equivalent to standard sampling
Monitoring Acceptance Rates
Monitor these metrics to ensure speculative decoding is effective:Acceptance Rate
The percentage of draft tokens accepted by the target model:Effective Speedup
The actual speedup considering draft cost and acceptance rate:Distribution Drift
KL divergence between draft and target distributions:Advanced: Training External Draft Models
For external draft models trained separately (e.g., with SpecForge):- Train a small draft model to mimic the target’s output distribution
- Save the draft model separately from the target
- Load both models during inference
- Periodically retrain the draft model as the target evolves
Best Practices
Choose the Right Draft Size
Choose the Right Draft Size
The draft model should be 4-8x smaller than the target for optimal speedup. Too small and acceptance rate drops; too large and draft overhead dominates.
Tune Draft Token Count
Tune Draft Token Count
Start with
--sglang-speculative-num-draft-tokens 4 and increase to 8 or 16 if acceptance rates are high (>80%). More draft tokens = more potential speedup.Monitor Continuously
Monitor Continuously
Track acceptance rates in your monitoring system. Sudden drops indicate distribution drift or other issues requiring attention.
Enable Online Training Early
Enable Online Training Early
Start online MTP training from the beginning of RL rather than adding it later. This prevents initial drift and maintains consistent speedup.
Supported Models
Speculative decoding with MTP layers is currently supported for:- GLM-4.7: Built-in Medusa heads
- DeepSeek-V3: EAGLE-style MTP layers
- DeepSeek-R1: EAGLE-style MTP layers
- Any model with Medusa or EAGLE heads
Troubleshooting
Low Acceptance Rates (<50%)
Low Acceptance Rates (<50%)
Causes:
- Draft model too different from target
- Distribution drift from RL training
- Incorrect draft model configuration
- Enable
--enable-mtp-trainingfor online updates - Reduce
--sglang-speculative-num-draft-tokens - Retrain or fine-tune draft model
No Speedup or Slowdown
No Speedup or Slowdown
Causes:
- Draft model too large
- Verification overhead too high
- Too many draft tokens with low acceptance
- Use a smaller draft model
- Reduce
--sglang-speculative-num-steps - Profile to identify bottlenecks
Missing MTP Weights
Missing MTP Weights
Error: “MTP layers not found in checkpoint”Solution: Convert checkpoint with
--mtp-num-layers 1:Future Improvements
Planned enhancements to speculative decoding in slime:- Support for training external draft models during RL
- Adaptive draft token count based on acceptance rates
- Multi-level speculative decoding with multiple draft models
- Integration with mixture-of-depths for even greater efficiency