How It Works
OPD modifies the advantage computation by subtracting a KL penalty term that encourages the student to match the teacher’s output distribution: Where:- is the original advantage from the base estimator (e.g., GRPO)
- is
--opd-kl-coef - is the token-level reverse KL divergence
The student model generates its own rollouts, not the teacher. This is “on-policy” because we’re distilling on the student’s own data distribution, not the teacher’s.
Key Arguments
| Argument | Description | Default |
|---|---|---|
--use-opd | Enable on-policy distillation | Required flag |
--opd-type | Type of OPD: sglang or megatron | Required |
--opd-kl-coef | OPD KL penalty coefficient | 1.0 |
--opd-teacher-load | Path to teacher Megatron checkpoint | Required for megatron type |
--opd-teacher-ckpt-step | Optional checkpoint step for teacher | None |
Two Teacher Modes
SGLang Mode
The teacher runs on an external SGLang server. Teacher log-probs are obtained during the rollout phase. When to use:- The teacher has a different architecture from the student
- The teacher is too large to load alongside the training model
- An external SGLang server runs the teacher model
- During rollout, a custom reward function (
slime.rollout.on_policy_distillation.reward_func) sends each sample to the teacher server to obtain token-level log-probs - A custom post-processing function (
slime.rollout.on_policy_distillation.post_process_rewards) trims the teacher log-probs to the response span and stores them insample.teacher_log_probs - During training, the KL penalty is computed from the stored teacher log-probs and applied to advantages
Megatron Mode
The teacher model is loaded directly into Megatron via--opd-teacher-load. Teacher log-probs are computed during the training forward pass.
When to use:
- The teacher has the same architecture as the student/reference model
- The teacher fits in GPU memory alongside the student
- The teacher model is loaded as an additional Megatron model during initialization
- During the training forward pass, the teacher model computes log-probs for each sample
- The KL penalty is computed inline and applied to advantages
Running the Examples
Complete example scripts are provided inexamples/on_policy_distillation/.
SGLang Teacher Example
Megatron Teacher Example
Preliminary Results
Using Qwen3-8B-Base model SFT-ed on part of the OpenThoughts3-1.2M dataset, on-policy distillation with a Qwen3-32B teacher on the remaining data yields:| Configuration | Pass@1 |
|---|---|
| Qwen3-8B-Base + SFT | 76% |
| Qwen3-8B-Base + SFT + On-Policy Distillation | 94% |
This represents an 18 percentage point improvement from distillation, demonstrating the effectiveness of learning from a larger teacher during RL training.
Combining with Other Techniques
OPD is orthogonal to other RL techniques and can be combined freely:OPD + GRPO
OPD + PPO
OPD + REINFORCE++
Tuning the KL Coefficient
The--opd-kl-coef parameter controls the strength of the distillation signal:
Low (0.1-0.5)
Weak distillation signal. Student focuses more on the RL reward signal. Use when teacher and student are similar.
Medium (0.5-2.0)
Balanced distillation. Good default for most settings. Student learns from both teacher and environment.
High (2.0+)
Strong distillation signal. Student strongly mimics teacher. Use when teacher is much better than student.
Recommended Tuning Process
- Start with
--opd-kl-coef 1.0 - Monitor both RL reward and teacher-student KL divergence
- If student diverges from teacher too quickly, increase coefficient
- If student doesn’t learn from environment, decrease coefficient
- Fine-tune based on validation performance
Architecture Comparison
Best Practices
Choose the Right Teacher
Choose the Right Teacher
Select a teacher that is:
- Significantly better than the student (2-4x larger is typical)
- Trained on similar or the same domain
- Compatible with your architecture (for Megatron mode)
Monitor KL Divergence
Monitor KL Divergence
Track the KL divergence between teacher and student over time:
- Should decrease initially as student learns
- Should stabilize at a low value
- Sudden increases indicate issues
Use SGLang Mode for Different Architectures
Use SGLang Mode for Different Architectures
If your teacher has a different architecture (e.g., different tokenizer, different layer count), you must use SGLang mode. Megatron mode requires identical architectures.
Balance Memory and Speed
Balance Memory and Speed
SGLang mode uses more network bandwidth but less GPU memory. Megatron mode uses more GPU memory but is faster. Choose based on your constraints.
Troubleshooting
Teacher logprobs are all zeros (SGLang mode)
Teacher logprobs are all zeros (SGLang mode)
Cause: Teacher server not configured correctly or custom reward function not being called.Solution:
- Verify teacher server is running:
curl http://teacher:10090/health - Check
--rm-urlpoints to teacher server - Ensure
--custom-rm-pathis set correctly
OOM when loading teacher (Megatron mode)
OOM when loading teacher (Megatron mode)
Cause: Not enough GPU memory for both student and teacher models.Solutions:
- Switch to SGLang mode with external teacher server
- Increase GPU count or use larger GPUs
- Enable CPU offloading for teacher (if supported)
- Use a smaller teacher model
Student not learning from teacher
Student not learning from teacher
Cause:
--opd-kl-coef too low or teacher and student too similar.Solutions:- Increase
--opd-kl-coefto 2.0 or higher - Verify teacher is actually better on validation set
- Check that teacher logprobs are being computed correctly
Student copying teacher too much
Student copying teacher too much
Cause:
--opd-kl-coef too high, overwhelming the RL signal.Solutions:- Decrease
--opd-kl-coefto 0.5 or lower - Ensure RL reward signal is strong enough
- Check that advantages are being computed correctly
Related Work
On-policy distillation builds on several key ideas:- Distillation: Learning from a teacher model (Hinton et al., 2015)
- On-policy learning: Training on the student’s own distribution
- KL penalties in RL: Constraining policy updates (Schulman et al., 2017)