Overview
slime implements multiple RL algorithms for LLM post-training, each with different trade-offs between simplicity, sample efficiency, and stability. All algorithms share the same infrastructure but differ in how they compute advantages and returns.
Algorithm Selection
# Select algorithm via advantage estimator
--advantage-estimator {grpo,gspo,ppo,reinforce_plus_plus,reinforce_plus_plus_baseline}
GRPO (Group Relative Policy Optimization)
Algorithm Description
GRPO computes advantages by comparing rewards within groups of responses generated from the same prompt. This eliminates the need for a value network.
From ppo_utils.py:201-208:
def get_grpo_returns ( rewards : torch.Tensor, kl : list[torch.Tensor]):
"""Compute GRPO returns by broadcasting scalar rewards"""
returns = []
for i in range ( len (rewards)):
# Each token gets the same reward value
returns.append(torch.ones_like(kl[i]) * rewards[i])
return returns
From loss.py:448-452:
if args.advantage_estimator == "grpo" :
rewards = torch.tensor(rewards, dtype = torch.float32)
returns = get_grpo_returns(rewards, kl)
advantages = [r for r in returns] # Returns = Advantages
Configuration
Basic GRPO
GRPO + KL Loss
GRPO_ARGS = (
--advantage-estimator grpo
# Number of responses per prompt (required for grouping)
--n-samples-per-prompt 8
# PPO clipping
--eps-clip 0.2
--eps-clip-high 0.28
# No KL penalty (rewards already account for divergence)
--kl-loss-coef 0.00
)
GRPO_ARGS = (
--advantage-estimator grpo
--n-samples-per-prompt 8
# Enable KL divergence monitoring
--use-kl-loss
--kl-loss-coef 0.00 # 0 = monitor only, don't add to loss
--kl-loss-type low_var_kl
# Entropy regularization
--entropy-coef 0.00
)
When to Use GRPO
You want a simple setup without value networks
You can sample multiple responses per prompt efficiently
Your reward function can meaningfully compare responses
You want faster iteration (no critic training)
You need dense per-token rewards
Your reward function is noisy across groups
You want maximum sample efficiency
PPO (Proximal Policy Optimization)
Algorithm Description
PPO uses a value network (critic) to estimate expected returns and computes advantages via Generalized Advantage Estimation (GAE).
From loss.py:455-466:
elif args.advantage_estimator == "ppo" :
# Construct per-token rewards from KL and terminal reward
old_rewards = rewards
rewards = []
kl_coef = - args.kl_coef
for reward, k in zip (old_rewards, kl):
k *= kl_coef
if cp_rank == 0 :
k[ - 1 ] += reward # Add terminal reward to last token
rewards.append(k)
# GAE for advantages
advantages, returns = get_advantages_and_returns_batch(
total_lengths, response_lengths, values, rewards,
args.gamma, args.lambd
)
Configuration
Actor-Critic PPO
Critic Warm-Start
PPO_ARGS = (
--advantage-estimator ppo
# Enable critic (value network)
--use-critic
--critic-num-nodes 1
--critic-num-gpus-per-node 4
# GAE parameters
--gamma 1.0 # Discount factor
--lambd 0.95 # GAE lambda
# PPO clipping
--eps-clip 0.2
--value-clip 0.2
# KL penalty
--kl-coef 0.01
)
PPO_ARGS = (
--advantage-estimator ppo
--use-critic
# Train only critic for first N rollouts
--num-critic-only-steps 10
# Then train both actor and critic
--gamma 1.0
--lambd 0.95
)
When to Use PPO
You want maximum sample efficiency
You have enough GPU resources for actor + critic
You need stable training for long runs
You want per-token credit assignment
You’re resource-constrained (need extra GPUs for critic)
You want faster iteration speed
Your task doesn’t benefit from dense rewards
GSPO (Group-Based Sequence-Level Policy Optimization)
Paper : GSPO Best for : Balancing GRPO’s simplicity with sequence-level credit assignment
Algorithm Description
GSPO extends GRPO by using sequence-level KL divergence instead of per-token KL.
From ppo_utils.py:95-121:
def compute_gspo_kl (
full_log_probs : list[torch.Tensor],
full_old_log_probs : list[torch.Tensor],
local_log_probs : list[torch.Tensor],
loss_masks : list[torch.Tensor],
) -> torch.Tensor:
"""Compute sequence-level KL and broadcast to all tokens"""
ppo_kl = [
((old_logprob - log_prob) * loss_mask).sum() /
torch.clamp_min(loss_mask.sum(), 1 )
for log_prob, old_logprob, loss_mask in
zip (full_log_probs, full_old_log_probs, loss_masks)
]
# Expand sequence-level KL to all tokens
ppo_kl = [
kl.expand_as(log_prob)
for kl, log_prob in zip (ppo_kl, local_log_probs)
]
return torch.cat(ppo_kl, dim = 0 )
Configuration
GSPO_ARGS = (
--advantage-estimator gspo
# Grouping (like GRPO)
--n-samples-per-prompt 8
# PPO clipping
--eps-clip 0.2
--eps-clip-high 0.28
# Sequence-level KL
--kl-loss-type low_var_kl
)
When to Use GSPO
You want GRPO’s simplicity but better credit assignment
You care about sequence-level coherence
You want to reduce variance compared to GRPO
Reinforce++
Paper : Reinforce++ Best for : Discount-aware training with temporal credit assignment
Algorithm Description
Reinforce++ computes discounted returns with per-token KL penalties.
From ppo_utils.py:211-278:
def get_reinforce_plus_plus_returns (
rewards : torch.Tensor,
kl : list[torch.Tensor],
loss_masks : list[torch.Tensor],
response_lengths : list[ int ],
total_lengths : list[ int ],
kl_coef : float ,
gamma : float ,
) -> list[torch.Tensor]:
"""Compute discounted returns for Reinforce++"""
final_returns = []
for i in range ( len (rewards)):
# 1. Gather full response (handle CP)
full_kl_response = all_gather_with_cp(kl[i], total_lengths[i], response_lengths[i])
# 2. Construct per-token rewards
full_mask = loss_masks[i]
masked_kl = full_kl_response * full_mask
token_level_rewards = - kl_coef * masked_kl
# 3. Add terminal reward
last_idx = full_mask.nonzero( as_tuple = True )[ 0 ][ - 1 ]
token_level_rewards[last_idx] += rewards[i]
# 4. Compute discounted returns
returns_for_seq = torch.zeros_like(token_level_rewards)
running_return = 0.0
for t in reversed ( range (token_level_rewards.size( 0 ))):
running_return = token_level_rewards[t] + gamma * running_return
returns_for_seq[t] = running_return
# 5. Slice back to local chunk (for CP)
local_returns = slice_log_prob_with_cp(returns_for_seq, ... )
final_returns.append(local_returns)
return final_returns
Configuration
Reinforce++
Reinforce++ Baseline
REINFORCE_ARGS = (
--advantage-estimator reinforce_plus_plus
# Discount factor (crucial for credit assignment)
--gamma 0.99
# KL penalty
--kl-coef 0.01
# PPO clipping
--eps-clip 0.2
)
REINFORCE_ARGS = (
--advantage-estimator reinforce_plus_plus_baseline
# Group-wise baseline subtraction
--n-samples-per-prompt 8
# KL penalty
--kl-coef 0.01
)
When to Use Reinforce++
You want temporal credit assignment
Your task benefits from discounting (e.g., multi-step reasoning)
You want to balance exploration and exploitation over time
Algorithm Comparison
Algorithm Value Network Sample Efficiency Complexity Best Use Case GRPO No Medium Low Fast iteration, simple setup PPO Yes High High Long training runs, maximum efficiency GSPO No Medium-High Medium Sequence-level coherence Reinforce++ No Medium-High Medium Multi-step reasoning tasks Reinforce++ Baseline No Medium Medium GRPO + discounting
Advanced Features
Advantage Normalization
Advantage normalization reduces variance and stabilizes training.
# Enable advantage whitening across data-parallel group
--normalize-advantages
From loss.py:504-557:
if args.normalize_advantages:
all_advs = torch.cat(advantages)
all_masks = torch.cat(loss_masks)
dp_group = mpu.get_data_parallel_group()
whitened_advs_flat = distributed_masked_whiten(
all_advs,
all_masks,
process_group = dp_group,
shift_mean = True , # Center to zero mean
)
advantages = list (torch.split(whitened_advs_flat, chunk_lengths))
KL Loss Types
From ppo_utils.py:12-51:
if kl_loss_type == "k1" :
kl = log_ratio
Simplest approximation, fastest to compute.
K2: Quadratic Approximation
elif kl_loss_type == "k2" :
kl = log_ratio ** 2 / 2.0
Better approximation, still fast.
K3 / Low-Variance KL (Recommended)
elif kl_loss_type in [ "k3" , "low_var_kl" ]:
log_ratio = - log_ratio
kl = log_ratio.exp() - 1 - log_ratio
kl = torch.clamp(kl, min =- 10 , max = 10 ) # Numerical stability
Non-negative, unbiased, lower variance. Recommended for most use cases.
PPO Clipping Strategies
From ppo_utils.py:124-148:
def compute_policy_loss ( ppo_kl , advantages , eps_clip , eps_clip_high , eps_clip_c = None ):
ratio = ( - ppo_kl).exp()
pg_losses1 = - ratio * advantages
pg_losses2 = - ratio.clamp( 1 - eps_clip, 1 + eps_clip_high) * advantages
clip_pg_losses1 = torch.maximum(pg_losses1, pg_losses2)
# Dual-clip PPO (optional)
if eps_clip_c is not None :
pg_losses3 = - eps_clip_c * advantages
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
pg_losses = torch.where(advantages < 0 , clip_pg_losses2, clip_pg_losses1)
else :
pg_losses = clip_pg_losses1
return pg_losses, clipfrac
Standard clipping:
--eps-clip 0.2 # Lower bound: 1 - 0.2 = 0.8
--eps-clip-high 0.28 # Upper bound: 1 + 0.28 = 1.28
Dual-clip PPO (aggressive clipping for negative advantages):
--eps-clip 0.2
--eps-clip-high 0.28
--eps-clip-c 3.0 # Extra clipping for advantages < 0
OPSM (Off-Policy Sequence Masking)
OPSM masks sequences that have diverged too far from the behavior policy.
# Enable OPSM
--use-opsm
--opsm-delta 0.1 # KL threshold for masking
From ppo_utils.py:54-92:
def compute_opsm_mask ( args , full_log_probs , full_old_log_probs ,
advantages , loss_masks ):
opsm_mask_list = []
for full_log_prob, full_old_log_prob, advantage, loss_mask in zip ( ... ):
# Sequence-level KL
seq_kl = ((full_old_log_prob - full_log_prob) * loss_mask).sum() / \
torch.clamp_min(loss_mask.sum(), 1 )
# Mask if: advantage < 0 AND seq_kl > threshold
mask = ((advantage < 0 ) & (seq_kl > args.opsm_delta)).float()
opsm_mask_list.append( 1 - mask)
return torch.cat(opsm_mask_list)
On-Policy Distillation (OPD)
OPD allows distilling from a teacher model while training with RL.
OPD_ARGS = (
--use-opd
--opd-type megatron # or sglang
--opd-kl-coef 0.1 # Weight for reverse KL penalty
)
From loss.py:359-397:
def apply_opd_kl_to_advantages ( args , rollout_data , advantages , student_log_probs ):
"""Add reverse KL penalty to advantages"""
teacher_log_probs = rollout_data.get( "teacher_log_probs" )
reverse_kls = []
for i, adv in enumerate (advantages):
# Reverse KL: D_KL(π_student || π_teacher)
reverse_kl = student_log_probs[i] - teacher_log_probs[i]
advantages[i] = adv - args.opd_kl_coef * reverse_kl
reverse_kls.append(reverse_kl)
rollout_data[ "opd_reverse_kl" ] = reverse_kls
Choosing an Algorithm
Start with GRPO
For most tasks, start with GRPO for fast iteration and simplicity. --advantage-estimator grpo
--n-samples-per-prompt 8
Try GSPO for Better Credit Assignment
If GRPO works but you want better sequence-level coherence: --advantage-estimator gspo
Scale Up with PPO
If you need maximum sample efficiency and have resources: --advantage-estimator ppo
--use-critic
--gamma 1.0
--lambd 0.95
Use Reinforce++ for Multi-Step Tasks
If your task involves multi-step reasoning: --advantage-estimator reinforce_plus_plus
--gamma 0.99
Training Loop Learn how algorithms fit into the training cycle
Rollout & Reward Understand reward model integration