Skip to main content
On-policy distillation (OPD) enables a student model to learn from a larger teacher model by training on its own rollouts while matching the teacher’s token-level log-probabilities. OPD is orthogonal to advantage estimators - it works as an additive KL penalty on top of any estimator (GRPO, PPO, REINFORCE++, etc.).

How It Works

OPD modifies the advantage computation by subtracting a KL penalty term that encourages the student to match the teacher’s output distribution: A^t=AtλopdDKL(PteacherPstudent)t\hat{A}_t = A_t - \lambda_{\text{opd}} \cdot D_{\text{KL}}(P_{\text{teacher}} \| P_{\text{student}})_t Where:
  • AtA_t is the original advantage from the base estimator (e.g., GRPO)
  • λopd\lambda_{\text{opd}} is --opd-kl-coef
  • DKLD_{\text{KL}} is the token-level reverse KL divergence
This means OPD can be combined with any advantage estimator, including GRPO, PPO, REINFORCE++, and GSPO.
The student model generates its own rollouts, not the teacher. This is “on-policy” because we’re distilling on the student’s own data distribution, not the teacher’s.

Key Arguments

ArgumentDescriptionDefault
--use-opdEnable on-policy distillationRequired flag
--opd-typeType of OPD: sglang or megatronRequired
--opd-kl-coefOPD KL penalty coefficient1.0
--opd-teacher-loadPath to teacher Megatron checkpointRequired for megatron type
--opd-teacher-ckpt-stepOptional checkpoint step for teacherNone

Two Teacher Modes

SGLang Mode

The teacher runs on an external SGLang server. Teacher log-probs are obtained during the rollout phase. When to use:
  • The teacher has a different architecture from the student
  • The teacher is too large to load alongside the training model
How it works:
  1. An external SGLang server runs the teacher model
  2. During rollout, a custom reward function (slime.rollout.on_policy_distillation.reward_func) sends each sample to the teacher server to obtain token-level log-probs
  3. A custom post-processing function (slime.rollout.on_policy_distillation.post_process_rewards) trims the teacher log-probs to the response span and stores them in sample.teacher_log_probs
  4. During training, the KL penalty is computed from the stored teacher log-probs and applied to advantages
Configuration:
OPD_ARGS=(
  --use-opd
  --opd-type sglang
  --opd-kl-coef 1.0
  --custom-rm-path slime.rollout.on_policy_distillation.reward_func
  --custom-reward-post-process-path slime.rollout.on_policy_distillation.post_process_rewards
  --rm-url http://<TEACHER_IP>:<TEACHER_PORT>/generate
)

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 train.py \
  ${MODEL_ARGS[@]} \
  ${OPD_ARGS[@]}

Megatron Mode

The teacher model is loaded directly into Megatron via --opd-teacher-load. Teacher log-probs are computed during the training forward pass. When to use:
  • The teacher has the same architecture as the student/reference model
  • The teacher fits in GPU memory alongside the student
How it works:
  1. The teacher model is loaded as an additional Megatron model during initialization
  2. During the training forward pass, the teacher model computes log-probs for each sample
  3. The KL penalty is computed inline and applied to advantages
Configuration:
OPD_ARGS=(
  --use-opd
  --opd-type megatron
  --opd-kl-coef 1.0
  --opd-teacher-load /path/to/teacher_torch_dist
)

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 train.py \
  ${MODEL_ARGS[@]} \
  ${OPD_ARGS[@]}
The teacher checkpoint must be in Megatron format (torch_dist or torch). Convert from HuggingFace format using:
python tools/convert_hf_to_torch_dist.py \
  --hf-checkpoint /path/to/teacher_hf \
  --save /path/to/teacher_torch_dist

Running the Examples

Complete example scripts are provided in examples/on_policy_distillation/.

SGLang Teacher Example

1

Download Models and Data

hf download Qwen/Qwen3-32B --local-dir /root/Qwen3-32B
hf download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
2

Convert Student Model

cd /root/slime
source scripts/models/qwen3-8B.sh

PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/Qwen3-8B \
    --save /root/Qwen3-8B_torch_dist
3

Run Training

bash examples/on_policy_distillation/run-qwen3-8B-opd.sh

Megatron Teacher Example

1

Convert Both Models

Convert both student and teacher models to Megatron format:
# Convert student
python tools/convert_hf_to_torch_dist.py \
    --hf-checkpoint /root/Qwen3-8B \
    --save /root/Qwen3-8B_torch_dist

# Convert teacher
python tools/convert_hf_to_torch_dist.py \
    --hf-checkpoint /root/Qwen3-32B \
    --save /root/Qwen3-32B_torch_dist
2

Run Training

bash examples/on_policy_distillation/run-qwen3-8B-opd-megatron.sh

Preliminary Results

Using Qwen3-8B-Base model SFT-ed on part of the OpenThoughts3-1.2M dataset, on-policy distillation with a Qwen3-32B teacher on the remaining data yields:
ConfigurationPass@1
Qwen3-8B-Base + SFT76%
Qwen3-8B-Base + SFT + On-Policy Distillation94%
This represents an 18 percentage point improvement from distillation, demonstrating the effectiveness of learning from a larger teacher during RL training.

Combining with Other Techniques

OPD is orthogonal to other RL techniques and can be combined freely:

OPD + GRPO

ADVANTAGE_ARGS=(
  --advantage-estimator grpo
  --use-kl-loss
  --kl-loss-coef 0.01
)

OPD_ARGS=(
  --use-opd
  --opd-type megatron
  --opd-kl-coef 1.0
  --opd-teacher-load /path/to/teacher
)

OPD + PPO

ADVANTAGE_ARGS=(
  --advantage-estimator ppo
  --value-model-load /path/to/value_model
  --eps-clip 0.2
)

OPD_ARGS=(
  --use-opd
  --opd-type sglang
  --opd-kl-coef 1.0
  --rm-url http://teacher:10090/generate
)

OPD + REINFORCE++

ADVANTAGE_ARGS=(
  --advantage-estimator reinforce_pp
  --use-advantage-normalization
)

OPD_ARGS=(
  --use-opd
  --opd-type megatron
  --opd-kl-coef 0.5
  --opd-teacher-load /path/to/teacher
)

Tuning the KL Coefficient

The --opd-kl-coef parameter controls the strength of the distillation signal:

Low (0.1-0.5)

Weak distillation signal. Student focuses more on the RL reward signal. Use when teacher and student are similar.

Medium (0.5-2.0)

Balanced distillation. Good default for most settings. Student learns from both teacher and environment.

High (2.0+)

Strong distillation signal. Student strongly mimics teacher. Use when teacher is much better than student.
  1. Start with --opd-kl-coef 1.0
  2. Monitor both RL reward and teacher-student KL divergence
  3. If student diverges from teacher too quickly, increase coefficient
  4. If student doesn’t learn from environment, decrease coefficient
  5. Fine-tune based on validation performance

Architecture Comparison

┌─────────────────────────────────────┐
│  Rollout Phase                      │
├─────────────────────────────────────┤
│  Student Model (SGLang)             │
│    ↓ generates samples               │
│  Teacher Model (SGLang)             │
│    ↓ computes logprobs               │
│  Store teacher_log_probs in sample  │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  Training Phase                     │
├─────────────────────────────────────┤
│  Student Model (Megatron)           │
│    ↓ forward pass
│  Compute KL(teacher || student)     │
│    ↓ from stored teacher_log_probs   │
│  Apply OPD penalty to advantages    │
│    ↓                                 │
│  Backward pass & update             │
└─────────────────────────────────────┘

Best Practices

Select a teacher that is:
  • Significantly better than the student (2-4x larger is typical)
  • Trained on similar or the same domain
  • Compatible with your architecture (for Megatron mode)
A teacher that’s too small won’t provide benefit; one that’s too large may be impractical.
Track the KL divergence between teacher and student over time:
  • Should decrease initially as student learns
  • Should stabilize at a low value
  • Sudden increases indicate issues
Log this metric alongside your RL rewards.
If your teacher has a different architecture (e.g., different tokenizer, different layer count), you must use SGLang mode. Megatron mode requires identical architectures.
SGLang mode uses more network bandwidth but less GPU memory. Megatron mode uses more GPU memory but is faster. Choose based on your constraints.

Troubleshooting

Cause: Teacher server not configured correctly or custom reward function not being called.Solution:
  • Verify teacher server is running: curl http://teacher:10090/health
  • Check --rm-url points to teacher server
  • Ensure --custom-rm-path is set correctly
Cause: Not enough GPU memory for both student and teacher models.Solutions:
  • Switch to SGLang mode with external teacher server
  • Increase GPU count or use larger GPUs
  • Enable CPU offloading for teacher (if supported)
  • Use a smaller teacher model
Cause: --opd-kl-coef too low or teacher and student too similar.Solutions:
  • Increase --opd-kl-coef to 2.0 or higher
  • Verify teacher is actually better on validation set
  • Check that teacher logprobs are being computed correctly
Cause: --opd-kl-coef too high, overwhelming the RL signal.Solutions:
  • Decrease --opd-kl-coef to 0.5 or lower
  • Ensure RL reward signal is strong enough
  • Check that advantages are being computed correctly
On-policy distillation builds on several key ideas:
  • Distillation: Learning from a teacher model (Hinton et al., 2015)
  • On-policy learning: Training on the student’s own distribution
  • KL penalties in RL: Constraining policy updates (Schulman et al., 2017)
For more details, see the slime paper and related publications on policy distillation in RL.

Build docs developers (and LLMs) love