Skip to main content
Speculative decoding is a key optimization for speeding up rollouts during RL training. Instead of having the expensive target model decode token by token, a lightweight draft model first decodes ahead to produce several tokens, which the target model then verifies in a batch. This can significantly improve throughput, especially for large models.

How Speculative Decoding Works

Speculative decoding follows this process:
  1. Draft Phase: A lightweight draft model quickly generates multiple candidate tokens
  2. Verification Phase: The target model verifies these candidates in parallel
  3. Acceptance: Verified tokens are accepted, rejected tokens trigger re-generation
This approach is faster because:
  • The draft model is much smaller and faster than the target
  • Verification is more efficient than token-by-token generation
  • Multiple tokens can be verified in a single forward pass

Quick Start with EAGLE

For models with MTP (Medusa-style) layers (e.g., GLM-4.7, DeepSeek-V3/R1), simply add these flags:
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4

Using a Separate Draft Model

If you want to use a separately trained draft model (e.g., trained with SpecForge), also set:
--sglang-speculative-draft-model-path /your/draft/model/path
For detailed parameter meanings and configuration options, see SGLang’s speculative decoding documentation.

Configuration Parameters

Here are the key parameters for configuring speculative decoding:
ParameterDescriptionDefault
--sglang-speculative-algorithmAlgorithm to use (e.g., EAGLE, Medusa)None
--sglang-speculative-num-stepsNumber of speculative steps3
--sglang-speculative-eagle-topkTop-k value for EAGLE1
--sglang-speculative-num-draft-tokensNumber of draft tokens to generate4
--sglang-speculative-draft-model-pathPath to external draft modelNone

Online SFT for the Draft Model

A key challenge in RL with speculative decoding is distribution drift: as RL training progresses, the sampling distributions of the draft and target models diverge. This causes:
  • Fewer draft tokens pass verification
  • Reduced speedup or even negative returns
  • Wasted computation on rejected drafts

The Solution: Online MTP Training

Slime supports online training of MTP layers during RL, updating the draft model in sync with the target model to maintain sampling speed. This approach is detailed in this blog post.

Enabling Online MTP Training

Add these flags to your training configuration:
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
This requires a torch_dist checkpoint with MTP weights. You need to add --mtp-num-layers 1 during checkpoint conversion from HuggingFace to torch_dist format.

How It Works

During training:
  1. Forward Pass: Both target model and MTP layers process the input
  2. Loss Computation: MTP loss is computed with scaling factor mtp-loss-scaling-factor
  3. Backward Pass: Gradients flow through both target and MTP layers
  4. Weight Update: Both models are updated simultaneously
  5. Inference: Updated MTP layers are used immediately for next rollout
This keeps the draft model aligned with the evolving target model, maintaining high acceptance rates throughout training.

Example Configuration

Here’s a complete example for training with speculative decoding:
MODEL_ARGS=(
  --model-type deepseek-v3
  --num-layers 61
  --hidden-size 7168
  --num-attention-heads 128
  --mtp-num-layers 1  # Enable MTP layers
)

SGLANG_ARGS=(
  # Speculative decoding config
  --sglang-speculative-algorithm EAGLE
  --sglang-speculative-num-steps 3
  --sglang-speculative-eagle-topk 1
  --sglang-speculative-num-draft-tokens 4
  
  # Rollout config
  --rollout-num-gpus-per-engine 2
  --sglang-mem-fraction-static 0.7
)

TRAINING_ARGS=(
  # Enable online MTP training
  --enable-mtp-training
  --mtp-loss-scaling-factor 0.2
)

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 train.py \
  ${MODEL_ARGS[@]} \
  ${SGLANG_ARGS[@]} \
  ${TRAINING_ARGS[@]}

Checkpoint Conversion with MTP

When converting checkpoints from HuggingFace to torch_dist format, include MTP layers:
source scripts/models/deepseek-v3.sh

PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
  ${MODEL_ARGS[@]} \
  --mtp-num-layers 1 \
  --hf-checkpoint /path/to/deepseek-v3 \
  --save /path/to/deepseek-v3_torch_dist
Without --mtp-num-layers 1 during conversion, the checkpoint will not contain MTP weights and online training will fail.

Performance Benefits

Speculative decoding with online MTP training can provide significant speedups:

2-3x Throughput

Typical speedup for large models with well-tuned draft models

Maintained Speedup

Online training prevents degradation over the course of RL training

No Quality Loss

Speculative decoding is mathematically equivalent to standard sampling

Monitoring Acceptance Rates

Monitor these metrics to ensure speculative decoding is effective:

Acceptance Rate

The percentage of draft tokens accepted by the target model:
acceptance_rate = accepted_tokens / total_draft_tokens
Target: >70% for good performance

Effective Speedup

The actual speedup considering draft cost and acceptance rate:
effective_speedup = (accepted_tokens * target_cost) / 
                    (draft_tokens * draft_cost + verify_cost)
Target: >2.0x for large models

Distribution Drift

KL divergence between draft and target distributions:
kl_drift = KL(p_target || p_draft)
Without online training, this typically increases over time. Online MTP training keeps it stable.

Advanced: Training External Draft Models

Training external draft models (not MTP layers) is currently a work in progress. The API and functionality may change in future releases.
For external draft models trained separately (e.g., with SpecForge):
  1. Train a small draft model to mimic the target’s output distribution
  2. Save the draft model separately from the target
  3. Load both models during inference
  4. Periodically retrain the draft model as the target evolves
This approach requires more infrastructure but can work with any target model architecture.

Best Practices

The draft model should be 4-8x smaller than the target for optimal speedup. Too small and acceptance rate drops; too large and draft overhead dominates.
Start with --sglang-speculative-num-draft-tokens 4 and increase to 8 or 16 if acceptance rates are high (>80%). More draft tokens = more potential speedup.
Track acceptance rates in your monitoring system. Sudden drops indicate distribution drift or other issues requiring attention.
Start online MTP training from the beginning of RL rather than adding it later. This prevents initial drift and maintains consistent speedup.

Supported Models

Speculative decoding with MTP layers is currently supported for:
  • GLM-4.7: Built-in Medusa heads
  • DeepSeek-V3: EAGLE-style MTP layers
  • DeepSeek-R1: EAGLE-style MTP layers
  • Any model with Medusa or EAGLE heads
For other models, you can train external draft models using SpecForge or similar approaches.

Troubleshooting

Causes:
  • Draft model too different from target
  • Distribution drift from RL training
  • Incorrect draft model configuration
Solutions:
  • Enable --enable-mtp-training for online updates
  • Reduce --sglang-speculative-num-draft-tokens
  • Retrain or fine-tune draft model
Causes:
  • Draft model too large
  • Verification overhead too high
  • Too many draft tokens with low acceptance
Solutions:
  • Use a smaller draft model
  • Reduce --sglang-speculative-num-steps
  • Profile to identify bottlenecks
Error: “MTP layers not found in checkpoint”Solution: Convert checkpoint with --mtp-num-layers 1:
python tools/convert_hf_to_torch_dist.py \
  --mtp-num-layers 1 \
  --hf-checkpoint /path/to/model \
  --save /path/to/output

Future Improvements

Planned enhancements to speculative decoding in slime:
  • Support for training external draft models during RL
  • Adaptive draft token count based on acceptance rates
  • Multi-level speculative decoding with multiple draft models
  • Integration with mixture-of-depths for even greater efficiency

Build docs developers (and LLMs) love