Speculative Decoding

Speculative decoding is a key optimization for speeding up rollouts during RL training. Instead of having the expensive target model decode token by token, a lightweight draft model first decodes ahead to produce several tokens, which the target model then verifies in a batch. This can significantly improve throughput, especially for large models.

How Speculative Decoding Works

Speculative decoding follows this process:

Draft Phase: A lightweight draft model quickly generates multiple candidate tokens
Verification Phase: The target model verifies these candidates in parallel
Acceptance: Verified tokens are accepted, rejected tokens trigger re-generation

This approach is faster because:

The draft model is much smaller and faster than the target
Verification is more efficient than token-by-token generation
Multiple tokens can be verified in a single forward pass

Quick Start with EAGLE

For models with MTP (Medusa-style) layers (e.g., GLM-4.7, DeepSeek-V3/R1), simply add these flags:

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4

Using a Separate Draft Model

If you want to use a separately trained draft model (e.g., trained with SpecForge), also set:

--sglang-speculative-draft-model-path /your/draft/model/path

For detailed parameter meanings and configuration options, see SGLang’s speculative decoding documentation.

Configuration Parameters

Here are the key parameters for configuring speculative decoding:

Parameter	Description	Default
`--sglang-speculative-algorithm`	Algorithm to use (e.g., EAGLE, Medusa)	None
`--sglang-speculative-num-steps`	Number of speculative steps	3
`--sglang-speculative-eagle-topk`	Top-k value for EAGLE	1
`--sglang-speculative-num-draft-tokens`	Number of draft tokens to generate	4
`--sglang-speculative-draft-model-path`	Path to external draft model	None

Online SFT for the Draft Model

A key challenge in RL with speculative decoding is distribution drift: as RL training progresses, the sampling distributions of the draft and target models diverge. This causes:

Fewer draft tokens pass verification
Reduced speedup or even negative returns
Wasted computation on rejected drafts

The Solution: Online MTP Training

Slime supports online training of MTP layers during RL, updating the draft model in sync with the target model to maintain sampling speed. This approach is detailed in this blog post.

Enabling Online MTP Training

Add these flags to your training configuration:

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

This requires a torch_dist checkpoint with MTP weights. You need to add --mtp-num-layers 1 during checkpoint conversion from HuggingFace to torch_dist format.

How It Works

During training:

Forward Pass: Both target model and MTP layers process the input
Loss Computation: MTP loss is computed with scaling factor mtp-loss-scaling-factor
Backward Pass: Gradients flow through both target and MTP layers
Weight Update: Both models are updated simultaneously
Inference: Updated MTP layers are used immediately for next rollout

This keeps the draft model aligned with the evolving target model, maintaining high acceptance rates throughout training.

Example Configuration

Here’s a complete example for training with speculative decoding:

MODEL_ARGS=(
  --model-type deepseek-v3
  --num-layers 61
  --hidden-size 7168
  --num-attention-heads 128
  --mtp-num-layers 1  # Enable MTP layers
)

SGLANG_ARGS=(
  # Speculative decoding config
  --sglang-speculative-algorithm EAGLE
  --sglang-speculative-num-steps 3
  --sglang-speculative-eagle-topk 1
  --sglang-speculative-num-draft-tokens 4
  
  # Rollout config
  --rollout-num-gpus-per-engine 2
  --sglang-mem-fraction-static 0.7
)

TRAINING_ARGS=(
  # Enable online MTP training
  --enable-mtp-training
  --mtp-loss-scaling-factor 0.2
)

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 train.py \
  ${MODEL_ARGS[@]} \
  ${SGLANG_ARGS[@]} \
  ${TRAINING_ARGS[@]}

Checkpoint Conversion with MTP

When converting checkpoints from HuggingFace to torch_dist format, include MTP layers:

source scripts/models/deepseek-v3.sh

PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
  ${MODEL_ARGS[@]} \
  --mtp-num-layers 1 \
  --hf-checkpoint /path/to/deepseek-v3 \
  --save /path/to/deepseek-v3_torch_dist

Without --mtp-num-layers 1 during conversion, the checkpoint will not contain MTP weights and online training will fail.

Performance Benefits

Speculative decoding with online MTP training can provide significant speedups:

2-3x Throughput

Typical speedup for large models with well-tuned draft models

Maintained Speedup

Online training prevents degradation over the course of RL training

No Quality Loss

Speculative decoding is mathematically equivalent to standard sampling

Monitoring Acceptance Rates

Monitor these metrics to ensure speculative decoding is effective:

Acceptance Rate

The percentage of draft tokens accepted by the target model:

acceptance_rate = accepted_tokens / total_draft_tokens

Target: >70% for good performance

Effective Speedup

The actual speedup considering draft cost and acceptance rate:

effective_speedup = (accepted_tokens * target_cost) / 
                    (draft_tokens * draft_cost + verify_cost)

Target: >2.0x for large models

Distribution Drift

KL divergence between draft and target distributions:

kl_drift = KL(p_target || p_draft)

Without online training, this typically increases over time. Online MTP training keeps it stable.

Advanced: Training External Draft Models

Training external draft models (not MTP layers) is currently a work in progress. The API and functionality may change in future releases.

For external draft models trained separately (e.g., with SpecForge):

Train a small draft model to mimic the target’s output distribution
Save the draft model separately from the target
Load both models during inference
Periodically retrain the draft model as the target evolves

This approach requires more infrastructure but can work with any target model architecture.

Best Practices

Choose the Right Draft Size

The draft model should be 4-8x smaller than the target for optimal speedup. Too small and acceptance rate drops; too large and draft overhead dominates.

Tune Draft Token Count

Start with --sglang-speculative-num-draft-tokens 4 and increase to 8 or 16 if acceptance rates are high (>80%). More draft tokens = more potential speedup.

Monitor Continuously

Track acceptance rates in your monitoring system. Sudden drops indicate distribution drift or other issues requiring attention.

Enable Online Training Early

Start online MTP training from the beginning of RL rather than adding it later. This prevents initial drift and maintains consistent speedup.

Supported Models

Speculative decoding with MTP layers is currently supported for:

GLM-4.7: Built-in Medusa heads
DeepSeek-V3: EAGLE-style MTP layers
DeepSeek-R1: EAGLE-style MTP layers
Any model with Medusa or EAGLE heads

For other models, you can train external draft models using SpecForge or similar approaches.

Troubleshooting

Low Acceptance Rates (<50%)

Causes:

Draft model too different from target
Distribution drift from RL training
Incorrect draft model configuration

Solutions:

Enable --enable-mtp-training for online updates
Reduce --sglang-speculative-num-draft-tokens
Retrain or fine-tune draft model

No Speedup or Slowdown

Causes:

Draft model too large
Verification overhead too high
Too many draft tokens with low acceptance

Solutions:

Use a smaller draft model
Reduce --sglang-speculative-num-steps
Profile to identify bottlenecks

Missing MTP Weights

Error: “MTP layers not found in checkpoint”Solution: Convert checkpoint with --mtp-num-layers 1:

python tools/convert_hf_to_torch_dist.py \
  --mtp-num-layers 1 \
  --hf-checkpoint /path/to/model \
  --save /path/to/output

Future Improvements

Planned enhancements to speculative decoding in slime:

Support for training external draft models during RL
Adaptive draft token count based on acceptance rates
Multi-level speculative decoding with multiple draft models
Integration with mixture-of-depths for even greater efficiency

Get Started

Core Concepts

Guides

Advanced

Platform Support

Speculative Decoding

How Speculative Decoding Works

Quick Start with EAGLE

Using a Separate Draft Model

Configuration Parameters

Online SFT for the Draft Model

The Solution: Online MTP Training

Enabling Online MTP Training

How It Works

Example Configuration

Checkpoint Conversion with MTP

Performance Benefits

2-3x Throughput

Maintained Speedup

No Quality Loss

Monitoring Acceptance Rates

Acceptance Rate

Effective Speedup

Distribution Drift

Advanced: Training External Draft Models

Best Practices

Supported Models

Troubleshooting

Future Improvements

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​How Speculative Decoding Works

​Quick Start with EAGLE

​Using a Separate Draft Model

​Configuration Parameters

​Online SFT for the Draft Model

​The Solution: Online MTP Training

​Enabling Online MTP Training

​How It Works

​Example Configuration

​Checkpoint Conversion with MTP

​Performance Benefits

2-3x Throughput

Maintained Speedup

No Quality Loss

​Monitoring Acceptance Rates

​Acceptance Rate

​Effective Speedup

​Distribution Drift

​Advanced: Training External Draft Models

​Best Practices

​Supported Models

​Troubleshooting

​Future Improvements

Build docs developers (and LLMs) love

How Speculative Decoding Works

Quick Start with EAGLE

Using a Separate Draft Model

Configuration Parameters

Online SFT for the Draft Model

The Solution: Online MTP Training

Enabling Online MTP Training

How It Works

Example Configuration

Checkpoint Conversion with MTP

Performance Benefits

Monitoring Acceptance Rates

Acceptance Rate

Effective Speedup

Distribution Drift

Advanced: Training External Draft Models

Best Practices

Supported Models

Troubleshooting

Future Improvements