On-Policy Distillation

On-policy distillation (OPD) enables a student model to learn from a larger teacher model by training on its own rollouts while matching the teacher’s token-level log-probabilities. OPD is orthogonal to advantage estimators - it works as an additive KL penalty on top of any estimator (GRPO, PPO, REINFORCE++, etc.).

How It Works

OPD modifies the advantage computation by subtracting a KL penalty term that encourages the student to match the teacher’s output distribution:

\hat{A}_t = A_t - \lambda_{\text{opd}} \cdot D_{\text{KL}}(P_{\text{teacher}} \| P_{\text{student}})_t

Where:

$A_t$ is the original advantage from the base estimator (e.g., GRPO)
$\lambda_{\text{opd}}$ is --opd-kl-coef
$D_{\text{KL}}$ is the token-level reverse KL divergence

This means OPD can be combined with any advantage estimator, including GRPO, PPO, REINFORCE++, and GSPO.

The student model generates its own rollouts, not the teacher. This is “on-policy” because we’re distilling on the student’s own data distribution, not the teacher’s.

Key Arguments

Argument	Description	Default
`--use-opd`	Enable on-policy distillation	Required flag
`--opd-type`	Type of OPD: `sglang` or `megatron`	Required
`--opd-kl-coef`	OPD KL penalty coefficient	1.0
`--opd-teacher-load`	Path to teacher Megatron checkpoint	Required for `megatron` type
`--opd-teacher-ckpt-step`	Optional checkpoint step for teacher	None

Two Teacher Modes

SGLang Mode

The teacher runs on an external SGLang server. Teacher log-probs are obtained during the rollout phase. When to use:

The teacher has a different architecture from the student
The teacher is too large to load alongside the training model

How it works:

An external SGLang server runs the teacher model
During rollout, a custom reward function (slime.rollout.on_policy_distillation.reward_func) sends each sample to the teacher server to obtain token-level log-probs
A custom post-processing function (slime.rollout.on_policy_distillation.post_process_rewards) trims the teacher log-probs to the response span and stores them in sample.teacher_log_probs
During training, the KL penalty is computed from the stored teacher log-probs and applied to advantages

Configuration:

OPD_ARGS=(
  --use-opd
  --opd-type sglang
  --opd-kl-coef 1.0
  --custom-rm-path slime.rollout.on_policy_distillation.reward_func
  --custom-reward-post-process-path slime.rollout.on_policy_distillation.post_process_rewards
  --rm-url http://<TEACHER_IP>:<TEACHER_PORT>/generate
)

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 train.py \
  ${MODEL_ARGS[@]} \
  ${OPD_ARGS[@]}

Megatron Mode

The teacher model is loaded directly into Megatron via --opd-teacher-load. Teacher log-probs are computed during the training forward pass. When to use:

The teacher has the same architecture as the student/reference model
The teacher fits in GPU memory alongside the student

How it works:

The teacher model is loaded as an additional Megatron model during initialization
During the training forward pass, the teacher model computes log-probs for each sample
The KL penalty is computed inline and applied to advantages

Configuration:

OPD_ARGS=(
  --use-opd
  --opd-type megatron
  --opd-kl-coef 1.0
  --opd-teacher-load /path/to/teacher_torch_dist
)

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 train.py \
  ${MODEL_ARGS[@]} \
  ${OPD_ARGS[@]}

The teacher checkpoint must be in Megatron format (torch_dist or torch). Convert from HuggingFace format using:

python tools/convert_hf_to_torch_dist.py \
  --hf-checkpoint /path/to/teacher_hf \
  --save /path/to/teacher_torch_dist

Running the Examples

Complete example scripts are provided in examples/on_policy_distillation/.

SGLang Teacher Example

Download Models and Data

hf download Qwen/Qwen3-32B --local-dir /root/Qwen3-32B
hf download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k

Convert Student Model

cd /root/slime
source scripts/models/qwen3-8B.sh

PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/Qwen3-8B \
    --save /root/Qwen3-8B_torch_dist

Run Training

bash examples/on_policy_distillation/run-qwen3-8B-opd.sh

Megatron Teacher Example

Convert Both Models

Convert both student and teacher models to Megatron format:

# Convert student
python tools/convert_hf_to_torch_dist.py \
    --hf-checkpoint /root/Qwen3-8B \
    --save /root/Qwen3-8B_torch_dist

# Convert teacher
python tools/convert_hf_to_torch_dist.py \
    --hf-checkpoint /root/Qwen3-32B \
    --save /root/Qwen3-32B_torch_dist

Run Training

bash examples/on_policy_distillation/run-qwen3-8B-opd-megatron.sh

Preliminary Results

Using Qwen3-8B-Base model SFT-ed on part of the OpenThoughts3-1.2M dataset, on-policy distillation with a Qwen3-32B teacher on the remaining data yields:

Configuration	Pass@1
Qwen3-8B-Base + SFT	76%
Qwen3-8B-Base + SFT + On-Policy Distillation	94%

This represents an 18 percentage point improvement from distillation, demonstrating the effectiveness of learning from a larger teacher during RL training.

Combining with Other Techniques

OPD is orthogonal to other RL techniques and can be combined freely:

OPD + GRPO

ADVANTAGE_ARGS=(
  --advantage-estimator grpo
  --use-kl-loss
  --kl-loss-coef 0.01
)

OPD_ARGS=(
  --use-opd
  --opd-type megatron
  --opd-kl-coef 1.0
  --opd-teacher-load /path/to/teacher
)

OPD + PPO

ADVANTAGE_ARGS=(
  --advantage-estimator ppo
  --value-model-load /path/to/value_model
  --eps-clip 0.2
)

OPD_ARGS=(
  --use-opd
  --opd-type sglang
  --opd-kl-coef 1.0
  --rm-url http://teacher:10090/generate
)

OPD + REINFORCE++

ADVANTAGE_ARGS=(
  --advantage-estimator reinforce_pp
  --use-advantage-normalization
)

OPD_ARGS=(
  --use-opd
  --opd-type megatron
  --opd-kl-coef 0.5
  --opd-teacher-load /path/to/teacher
)

Tuning the KL Coefficient

The --opd-kl-coef parameter controls the strength of the distillation signal:

Low (0.1-0.5)

Weak distillation signal. Student focuses more on the RL reward signal. Use when teacher and student are similar.

Medium (0.5-2.0)

Balanced distillation. Good default for most settings. Student learns from both teacher and environment.

High (2.0+)

Strong distillation signal. Student strongly mimics teacher. Use when teacher is much better than student.

Recommended Tuning Process

Start with --opd-kl-coef 1.0
Monitor both RL reward and teacher-student KL divergence
If student diverges from teacher too quickly, increase coefficient
If student doesn’t learn from environment, decrease coefficient
Fine-tune based on validation performance

Architecture Comparison

┌─────────────────────────────────────┐
│  Rollout Phase                      │
├─────────────────────────────────────┤
│  Student Model (SGLang)             │
│    ↓ generates samples               │
│  Teacher Model (SGLang)             │
│    ↓ computes logprobs               │
│  Store teacher_log_probs in sample  │
└─────────────────────────────────────┘
            ↓
┌─────────────────────────────────────┐
│  Training Phase                     │
├─────────────────────────────────────┤
│  Student Model (Megatron)           │
│    ↓ forward pass                    │
│  Compute KL(teacher || student)     │
│    ↓ from stored teacher_log_probs   │
│  Apply OPD penalty to advantages    │
│    ↓                                 │
│  Backward pass & update             │
└─────────────────────────────────────┘

Best Practices

Choose the Right Teacher

Select a teacher that is:

Significantly better than the student (2-4x larger is typical)
Trained on similar or the same domain
Compatible with your architecture (for Megatron mode)

A teacher that’s too small won’t provide benefit; one that’s too large may be impractical.

Monitor KL Divergence

Track the KL divergence between teacher and student over time:

Should decrease initially as student learns
Should stabilize at a low value
Sudden increases indicate issues

Log this metric alongside your RL rewards.

Use SGLang Mode for Different Architectures

If your teacher has a different architecture (e.g., different tokenizer, different layer count), you must use SGLang mode. Megatron mode requires identical architectures.

Balance Memory and Speed

SGLang mode uses more network bandwidth but less GPU memory. Megatron mode uses more GPU memory but is faster. Choose based on your constraints.

Troubleshooting

Teacher logprobs are all zeros (SGLang mode)

Cause: Teacher server not configured correctly or custom reward function not being called.Solution:

Verify teacher server is running: curl http://teacher:10090/health
Check --rm-url points to teacher server
Ensure --custom-rm-path is set correctly

OOM when loading teacher (Megatron mode)

Cause: Not enough GPU memory for both student and teacher models.Solutions:

Switch to SGLang mode with external teacher server
Increase GPU count or use larger GPUs
Enable CPU offloading for teacher (if supported)
Use a smaller teacher model

Student not learning from teacher

Cause: --opd-kl-coef too low or teacher and student too similar.Solutions:

Increase --opd-kl-coef to 2.0 or higher
Verify teacher is actually better on validation set
Check that teacher logprobs are being computed correctly

Student copying teacher too much

Cause: --opd-kl-coef too high, overwhelming the RL signal.Solutions:

Decrease --opd-kl-coef to 0.5 or lower
Ensure RL reward signal is strong enough
Check that advantages are being computed correctly

On-policy distillation builds on several key ideas:

Distillation: Learning from a teacher model (Hinton et al., 2015)
On-policy learning: Training on the student’s own distribution
KL penalties in RL: Constraining policy updates (Schulman et al., 2017)

For more details, see the slime paper and related publications on policy distillation in RL.

Get Started

Core Concepts

Guides

Advanced

Platform Support

On-Policy Distillation

How It Works

Key Arguments

Two Teacher Modes

SGLang Mode

Megatron Mode

Running the Examples

SGLang Teacher Example

Megatron Teacher Example

Preliminary Results

Combining with Other Techniques

OPD + GRPO

OPD + PPO

OPD + REINFORCE++

Tuning the KL Coefficient

Low (0.1-0.5)

Medium (0.5-2.0)

High (2.0+)

Recommended Tuning Process

Architecture Comparison

Best Practices

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​How It Works

​Key Arguments

​Two Teacher Modes

​SGLang Mode

​Megatron Mode

​Running the Examples

​SGLang Teacher Example

​Megatron Teacher Example

​Preliminary Results

​Combining with Other Techniques

​OPD + GRPO

​OPD + PPO

​OPD + REINFORCE++

​Tuning the KL Coefficient

Low (0.1-0.5)

Medium (0.5-2.0)

High (2.0+)

​Recommended Tuning Process

​Architecture Comparison

​Best Practices

​Troubleshooting

​Related Work

Build docs developers (and LLMs) love

How It Works

Key Arguments

Two Teacher Modes

SGLang Mode

Megatron Mode

Running the Examples

SGLang Teacher Example

Megatron Teacher Example

Preliminary Results

Combining with Other Techniques

OPD + GRPO

OPD + PPO

OPD + REINFORCE++

Tuning the KL Coefficient

Recommended Tuning Process

Architecture Comparison

Best Practices

Troubleshooting

Related Work