Skip to main content

Overview

slime implements multiple RL algorithms for LLM post-training, each with different trade-offs between simplicity, sample efficiency, and stability. All algorithms share the same infrastructure but differ in how they compute advantages and returns.

Algorithm Selection

# Select algorithm via advantage estimator
--advantage-estimator {grpo,gspo,ppo,reinforce_plus_plus,reinforce_plus_plus_baseline}

GRPO (Group Relative Policy Optimization)

Paper: GRPO: Group Relative Policy OptimizationBest for: Simple setup, no value network required, works well with group-based rewards

Algorithm Description

GRPO computes advantages by comparing rewards within groups of responses generated from the same prompt. This eliminates the need for a value network. From ppo_utils.py:201-208:
def get_grpo_returns(rewards: torch.Tensor, kl: list[torch.Tensor]):
    """Compute GRPO returns by broadcasting scalar rewards"""
    returns = []
    for i in range(len(rewards)):
        # Each token gets the same reward value
        returns.append(torch.ones_like(kl[i]) * rewards[i])
    return returns
From loss.py:448-452:
if args.advantage_estimator == "grpo":
    rewards = torch.tensor(rewards, dtype=torch.float32)
    returns = get_grpo_returns(rewards, kl)
    advantages = [r for r in returns]  # Returns = Advantages

Configuration

GRPO_ARGS=(
    --advantage-estimator grpo
    
    # Number of responses per prompt (required for grouping)
    --n-samples-per-prompt 8
    
    # PPO clipping
    --eps-clip 0.2
    --eps-clip-high 0.28
    
    # No KL penalty (rewards already account for divergence)
    --kl-loss-coef 0.00
)

When to Use GRPO

Use GRPO when:
  • You want a simple setup without value networks
  • You can sample multiple responses per prompt efficiently
  • Your reward function can meaningfully compare responses
  • You want faster iteration (no critic training)
Avoid GRPO when:
  • You need dense per-token rewards
  • Your reward function is noisy across groups
  • You want maximum sample efficiency

PPO (Proximal Policy Optimization)

Paper: Proximal Policy Optimization AlgorithmsBest for: Maximum sample efficiency, stable training with value network

Algorithm Description

PPO uses a value network (critic) to estimate expected returns and computes advantages via Generalized Advantage Estimation (GAE). From loss.py:455-466:
elif args.advantage_estimator == "ppo":
    # Construct per-token rewards from KL and terminal reward
    old_rewards = rewards
    rewards = []
    kl_coef = -args.kl_coef
    for reward, k in zip(old_rewards, kl):
        k *= kl_coef
        if cp_rank == 0:
            k[-1] += reward  # Add terminal reward to last token
        rewards.append(k)
    
    # GAE for advantages
    advantages, returns = get_advantages_and_returns_batch(
        total_lengths, response_lengths, values, rewards, 
        args.gamma, args.lambd
    )

Configuration

PPO_ARGS=(
    --advantage-estimator ppo
    
    # Enable critic (value network)
    --use-critic
    --critic-num-nodes 1
    --critic-num-gpus-per-node 4
    
    # GAE parameters
    --gamma 1.0      # Discount factor
    --lambd 0.95     # GAE lambda
    
    # PPO clipping
    --eps-clip 0.2
    --value-clip 0.2
    
    # KL penalty
    --kl-coef 0.01
)

When to Use PPO

Use PPO when:
  • You want maximum sample efficiency
  • You have enough GPU resources for actor + critic
  • You need stable training for long runs
  • You want per-token credit assignment
Avoid PPO when:
  • You’re resource-constrained (need extra GPUs for critic)
  • You want faster iteration speed
  • Your task doesn’t benefit from dense rewards

GSPO (Group-Based Sequence-Level Policy Optimization)

Paper: GSPOBest for: Balancing GRPO’s simplicity with sequence-level credit assignment

Algorithm Description

GSPO extends GRPO by using sequence-level KL divergence instead of per-token KL. From ppo_utils.py:95-121:
def compute_gspo_kl(
    full_log_probs: list[torch.Tensor],
    full_old_log_probs: list[torch.Tensor],
    local_log_probs: list[torch.Tensor],
    loss_masks: list[torch.Tensor],
) -> torch.Tensor:
    """Compute sequence-level KL and broadcast to all tokens"""
    ppo_kl = [
        ((old_logprob - log_prob) * loss_mask).sum() / 
        torch.clamp_min(loss_mask.sum(), 1)
        for log_prob, old_logprob, loss_mask in 
        zip(full_log_probs, full_old_log_probs, loss_masks)
    ]
    # Expand sequence-level KL to all tokens
    ppo_kl = [
        kl.expand_as(log_prob) 
        for kl, log_prob in zip(ppo_kl, local_log_probs)
    ]
    return torch.cat(ppo_kl, dim=0)

Configuration

GSPO_ARGS=(
    --advantage-estimator gspo
    
    # Grouping (like GRPO)
    --n-samples-per-prompt 8
    
    # PPO clipping
    --eps-clip 0.2
    --eps-clip-high 0.28
    
    # Sequence-level KL
    --kl-loss-type low_var_kl
)

When to Use GSPO

Use GSPO when:
  • You want GRPO’s simplicity but better credit assignment
  • You care about sequence-level coherence
  • You want to reduce variance compared to GRPO

Reinforce++

Paper: Reinforce++Best for: Discount-aware training with temporal credit assignment

Algorithm Description

Reinforce++ computes discounted returns with per-token KL penalties. From ppo_utils.py:211-278:
def get_reinforce_plus_plus_returns(
    rewards: torch.Tensor,
    kl: list[torch.Tensor],
    loss_masks: list[torch.Tensor],
    response_lengths: list[int],
    total_lengths: list[int],
    kl_coef: float,
    gamma: float,
) -> list[torch.Tensor]:
    """Compute discounted returns for Reinforce++"""
    
    final_returns = []
    for i in range(len(rewards)):
        # 1. Gather full response (handle CP)
        full_kl_response = all_gather_with_cp(kl[i], total_lengths[i], response_lengths[i])
        
        # 2. Construct per-token rewards
        full_mask = loss_masks[i]
        masked_kl = full_kl_response * full_mask
        token_level_rewards = -kl_coef * masked_kl
        
        # 3. Add terminal reward
        last_idx = full_mask.nonzero(as_tuple=True)[0][-1]
        token_level_rewards[last_idx] += rewards[i]
        
        # 4. Compute discounted returns
        returns_for_seq = torch.zeros_like(token_level_rewards)
        running_return = 0.0
        for t in reversed(range(token_level_rewards.size(0))):
            running_return = token_level_rewards[t] + gamma * running_return
            returns_for_seq[t] = running_return
        
        # 5. Slice back to local chunk (for CP)
        local_returns = slice_log_prob_with_cp(returns_for_seq, ...)
        final_returns.append(local_returns)
    
    return final_returns

Configuration

REINFORCE_ARGS=(
    --advantage-estimator reinforce_plus_plus
    
    # Discount factor (crucial for credit assignment)
    --gamma 0.99
    
    # KL penalty
    --kl-coef 0.01
    
    # PPO clipping
    --eps-clip 0.2
)

When to Use Reinforce++

Use Reinforce++ when:
  • You want temporal credit assignment
  • Your task benefits from discounting (e.g., multi-step reasoning)
  • You want to balance exploration and exploitation over time

Algorithm Comparison

AlgorithmValue NetworkSample EfficiencyComplexityBest Use Case
GRPONoMediumLowFast iteration, simple setup
PPOYesHighHighLong training runs, maximum efficiency
GSPONoMedium-HighMediumSequence-level coherence
Reinforce++NoMedium-HighMediumMulti-step reasoning tasks
Reinforce++ BaselineNoMediumMediumGRPO + discounting

Advanced Features

Advantage Normalization

Advantage normalization reduces variance and stabilizes training.
# Enable advantage whitening across data-parallel group
--normalize-advantages
From loss.py:504-557:
if args.normalize_advantages:
    all_advs = torch.cat(advantages)
    all_masks = torch.cat(loss_masks)
    
    dp_group = mpu.get_data_parallel_group()
    whitened_advs_flat = distributed_masked_whiten(
        all_advs,
        all_masks,
        process_group=dp_group,
        shift_mean=True,  # Center to zero mean
    )
    
    advantages = list(torch.split(whitened_advs_flat, chunk_lengths))

KL Loss Types

From ppo_utils.py:12-51:
if kl_loss_type == "k1":
    kl = log_ratio
Simplest approximation, fastest to compute.
elif kl_loss_type == "k2":
    kl = log_ratio**2 / 2.0
Better approximation, still fast.

PPO Clipping Strategies

From ppo_utils.py:124-148:
def compute_policy_loss(ppo_kl, advantages, eps_clip, eps_clip_high, eps_clip_c=None):
    ratio = (-ppo_kl).exp()
    pg_losses1 = -ratio * advantages
    pg_losses2 = -ratio.clamp(1 - eps_clip, 1 + eps_clip_high) * advantages
    clip_pg_losses1 = torch.maximum(pg_losses1, pg_losses2)
    
    # Dual-clip PPO (optional)
    if eps_clip_c is not None:
        pg_losses3 = -eps_clip_c * advantages
        clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
        pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
    else:
        pg_losses = clip_pg_losses1
    
    return pg_losses, clipfrac
Standard clipping:
--eps-clip 0.2        # Lower bound: 1 - 0.2 = 0.8
--eps-clip-high 0.28  # Upper bound: 1 + 0.28 = 1.28
Dual-clip PPO (aggressive clipping for negative advantages):
--eps-clip 0.2
--eps-clip-high 0.28
--eps-clip-c 3.0  # Extra clipping for advantages < 0

OPSM (Off-Policy Sequence Masking)

OPSM masks sequences that have diverged too far from the behavior policy.
# Enable OPSM
--use-opsm
--opsm-delta 0.1  # KL threshold for masking
From ppo_utils.py:54-92:
def compute_opsm_mask(args, full_log_probs, full_old_log_probs, 
                       advantages, loss_masks):
    opsm_mask_list = []
    
    for full_log_prob, full_old_log_prob, advantage, loss_mask in zip(...):
        # Sequence-level KL
        seq_kl = ((full_old_log_prob - full_log_prob) * loss_mask).sum() / \
                 torch.clamp_min(loss_mask.sum(), 1)
        
        # Mask if: advantage < 0 AND seq_kl > threshold
        mask = ((advantage < 0) & (seq_kl > args.opsm_delta)).float()
        opsm_mask_list.append(1 - mask)
    
    return torch.cat(opsm_mask_list)

On-Policy Distillation (OPD)

OPD allows distilling from a teacher model while training with RL.
OPD_ARGS=(
    --use-opd
    --opd-type megatron  # or sglang
    --opd-kl-coef 0.1    # Weight for reverse KL penalty
)
From loss.py:359-397:
def apply_opd_kl_to_advantages(args, rollout_data, advantages, student_log_probs):
    """Add reverse KL penalty to advantages"""
    teacher_log_probs = rollout_data.get("teacher_log_probs")
    
    reverse_kls = []
    for i, adv in enumerate(advantages):
        # Reverse KL: D_KL(π_student || π_teacher)
        reverse_kl = student_log_probs[i] - teacher_log_probs[i]
        advantages[i] = adv - args.opd_kl_coef * reverse_kl
        reverse_kls.append(reverse_kl)
    
    rollout_data["opd_reverse_kl"] = reverse_kls

Choosing an Algorithm

1

Start with GRPO

For most tasks, start with GRPO for fast iteration and simplicity.
--advantage-estimator grpo
--n-samples-per-prompt 8
2

Try GSPO for Better Credit Assignment

If GRPO works but you want better sequence-level coherence:
--advantage-estimator gspo
3

Scale Up with PPO

If you need maximum sample efficiency and have resources:
--advantage-estimator ppo
--use-critic
--gamma 1.0
--lambd 0.95
4

Use Reinforce++ for Multi-Step Tasks

If your task involves multi-step reasoning:
--advantage-estimator reinforce_plus_plus
--gamma 0.99

Training Loop

Learn how algorithms fit into the training cycle

Rollout & Reward

Understand reward model integration

Build docs developers (and LLMs) love