RL Algorithms

Overview

slime implements multiple RL algorithms for LLM post-training, each with different trade-offs between simplicity, sample efficiency, and stability. All algorithms share the same infrastructure but differ in how they compute advantages and returns.

Algorithm Selection

# Select algorithm via advantage estimator
--advantage-estimator {grpo,gspo,ppo,reinforce_plus_plus,reinforce_plus_plus_baseline}

GRPO (Group Relative Policy Optimization)

Paper: GRPO: Group Relative Policy OptimizationBest for: Simple setup, no value network required, works well with group-based rewards

Algorithm Description

GRPO computes advantages by comparing rewards within groups of responses generated from the same prompt. This eliminates the need for a value network. From ppo_utils.py:201-208:

def get_grpo_returns(rewards: torch.Tensor, kl: list[torch.Tensor]):
    """Compute GRPO returns by broadcasting scalar rewards"""
    returns = []
    for i in range(len(rewards)):
        # Each token gets the same reward value
        returns.append(torch.ones_like(kl[i]) * rewards[i])
    return returns

From loss.py:448-452:

if args.advantage_estimator == "grpo":
    rewards = torch.tensor(rewards, dtype=torch.float32)
    returns = get_grpo_returns(rewards, kl)
    advantages = [r for r in returns]  # Returns = Advantages

Configuration

Basic GRPO
GRPO + KL Loss

GRPO_ARGS=(
    --advantage-estimator grpo
    
    # Number of responses per prompt (required for grouping)
    --n-samples-per-prompt 8
    
    # PPO clipping
    --eps-clip 0.2
    --eps-clip-high 0.28
    
    # No KL penalty (rewards already account for divergence)
    --kl-loss-coef 0.00
)

GRPO_ARGS=(
    --advantage-estimator grpo
    --n-samples-per-prompt 8
    
    # Enable KL divergence monitoring
    --use-kl-loss
    --kl-loss-coef 0.00  # 0 = monitor only, don't add to loss
    --kl-loss-type low_var_kl
    
    # Entropy regularization
    --entropy-coef 0.00
)

When to Use GRPO

Use GRPO when:

You want a simple setup without value networks
You can sample multiple responses per prompt efficiently
Your reward function can meaningfully compare responses
You want faster iteration (no critic training)

Avoid GRPO when:

You need dense per-token rewards
Your reward function is noisy across groups
You want maximum sample efficiency

PPO (Proximal Policy Optimization)

Paper: Proximal Policy Optimization AlgorithmsBest for: Maximum sample efficiency, stable training with value network

Algorithm Description

PPO uses a value network (critic) to estimate expected returns and computes advantages via Generalized Advantage Estimation (GAE). From loss.py:455-466:

elif args.advantage_estimator == "ppo":
    # Construct per-token rewards from KL and terminal reward
    old_rewards = rewards
    rewards = []
    kl_coef = -args.kl_coef
    for reward, k in zip(old_rewards, kl):
        k *= kl_coef
        if cp_rank == 0:
            k[-1] += reward  # Add terminal reward to last token
        rewards.append(k)
    
    # GAE for advantages
    advantages, returns = get_advantages_and_returns_batch(
        total_lengths, response_lengths, values, rewards, 
        args.gamma, args.lambd
    )

Configuration

Actor-Critic PPO
Critic Warm-Start

PPO_ARGS=(
    --advantage-estimator ppo
    
    # Enable critic (value network)
    --use-critic
    --critic-num-nodes 1
    --critic-num-gpus-per-node 4
    
    # GAE parameters
    --gamma 1.0      # Discount factor
    --lambd 0.95     # GAE lambda
    
    # PPO clipping
    --eps-clip 0.2
    --value-clip 0.2
    
    # KL penalty
    --kl-coef 0.01
)

PPO_ARGS=(
    --advantage-estimator ppo
    --use-critic
    
    # Train only critic for first N rollouts
    --num-critic-only-steps 10
    
    # Then train both actor and critic
    --gamma 1.0
    --lambd 0.95
)

When to Use PPO

Use PPO when:

You want maximum sample efficiency
You have enough GPU resources for actor + critic
You need stable training for long runs
You want per-token credit assignment

Avoid PPO when:

You’re resource-constrained (need extra GPUs for critic)
You want faster iteration speed
Your task doesn’t benefit from dense rewards

GSPO (Group-Based Sequence-Level Policy Optimization)

Paper: GSPOBest for: Balancing GRPO’s simplicity with sequence-level credit assignment

Algorithm Description

GSPO extends GRPO by using sequence-level KL divergence instead of per-token KL. From ppo_utils.py:95-121:

def compute_gspo_kl(
    full_log_probs: list[torch.Tensor],
    full_old_log_probs: list[torch.Tensor],
    local_log_probs: list[torch.Tensor],
    loss_masks: list[torch.Tensor],
) -> torch.Tensor:
    """Compute sequence-level KL and broadcast to all tokens"""
    ppo_kl = [
        ((old_logprob - log_prob) * loss_mask).sum() / 
        torch.clamp_min(loss_mask.sum(), 1)
        for log_prob, old_logprob, loss_mask in 
        zip(full_log_probs, full_old_log_probs, loss_masks)
    ]
    # Expand sequence-level KL to all tokens
    ppo_kl = [
        kl.expand_as(log_prob) 
        for kl, log_prob in zip(ppo_kl, local_log_probs)
    ]
    return torch.cat(ppo_kl, dim=0)

Configuration

GSPO_ARGS=(
    --advantage-estimator gspo
    
    # Grouping (like GRPO)
    --n-samples-per-prompt 8
    
    # PPO clipping
    --eps-clip 0.2
    --eps-clip-high 0.28
    
    # Sequence-level KL
    --kl-loss-type low_var_kl
)

When to Use GSPO

Use GSPO when:

You want GRPO’s simplicity but better credit assignment
You care about sequence-level coherence
You want to reduce variance compared to GRPO

Reinforce++

Paper: Reinforce++Best for: Discount-aware training with temporal credit assignment

Algorithm Description

Reinforce++ computes discounted returns with per-token KL penalties. From ppo_utils.py:211-278:

def get_reinforce_plus_plus_returns(
    rewards: torch.Tensor,
    kl: list[torch.Tensor],
    loss_masks: list[torch.Tensor],
    response_lengths: list[int],
    total_lengths: list[int],
    kl_coef: float,
    gamma: float,
) -> list[torch.Tensor]:
    """Compute discounted returns for Reinforce++"""
    
    final_returns = []
    for i in range(len(rewards)):
        # 1. Gather full response (handle CP)
        full_kl_response = all_gather_with_cp(kl[i], total_lengths[i], response_lengths[i])
        
        # 2. Construct per-token rewards
        full_mask = loss_masks[i]
        masked_kl = full_kl_response * full_mask
        token_level_rewards = -kl_coef * masked_kl
        
        # 3. Add terminal reward
        last_idx = full_mask.nonzero(as_tuple=True)[0][-1]
        token_level_rewards[last_idx] += rewards[i]
        
        # 4. Compute discounted returns
        returns_for_seq = torch.zeros_like(token_level_rewards)
        running_return = 0.0
        for t in reversed(range(token_level_rewards.size(0))):
            running_return = token_level_rewards[t] + gamma * running_return
            returns_for_seq[t] = running_return
        
        # 5. Slice back to local chunk (for CP)
        local_returns = slice_log_prob_with_cp(returns_for_seq, ...)
        final_returns.append(local_returns)
    
    return final_returns

Configuration

Reinforce++
Reinforce++ Baseline

REINFORCE_ARGS=(
    --advantage-estimator reinforce_plus_plus
    
    # Discount factor (crucial for credit assignment)
    --gamma 0.99
    
    # KL penalty
    --kl-coef 0.01
    
    # PPO clipping
    --eps-clip 0.2
)

REINFORCE_ARGS=(
    --advantage-estimator reinforce_plus_plus_baseline
    
    # Group-wise baseline subtraction
    --n-samples-per-prompt 8
    
    # KL penalty
    --kl-coef 0.01
)

When to Use Reinforce++

Use Reinforce++ when:

You want temporal credit assignment
Your task benefits from discounting (e.g., multi-step reasoning)
You want to balance exploration and exploitation over time

Algorithm Comparison

Algorithm	Value Network	Sample Efficiency	Complexity	Best Use Case
GRPO	No	Medium	Low	Fast iteration, simple setup
PPO	Yes	High	High	Long training runs, maximum efficiency
GSPO	No	Medium-High	Medium	Sequence-level coherence
Reinforce++	No	Medium-High	Medium	Multi-step reasoning tasks
Reinforce++ Baseline	No	Medium	Medium	GRPO + discounting

Advanced Features

Advantage Normalization

Advantage normalization reduces variance and stabilizes training.

# Enable advantage whitening across data-parallel group
--normalize-advantages

From loss.py:504-557:

if args.normalize_advantages:
    all_advs = torch.cat(advantages)
    all_masks = torch.cat(loss_masks)
    
    dp_group = mpu.get_data_parallel_group()
    whitened_advs_flat = distributed_masked_whiten(
        all_advs,
        all_masks,
        process_group=dp_group,
        shift_mean=True,  # Center to zero mean
    )
    
    advantages = list(torch.split(whitened_advs_flat, chunk_lengths))

KL Loss Types

From ppo_utils.py:12-51:

K1: Linear Approximation

if kl_loss_type == "k1":
    kl = log_ratio

Simplest approximation, fastest to compute.

K2: Quadratic Approximation

elif kl_loss_type == "k2":
    kl = log_ratio**2 / 2.0

Better approximation, still fast.

K3 / Low-Variance KL (Recommended)

elif kl_loss_type in ["k3", "low_var_kl"]:
    log_ratio = -log_ratio
    kl = log_ratio.exp() - 1 - log_ratio
    kl = torch.clamp(kl, min=-10, max=10)  # Numerical stability

Non-negative, unbiased, lower variance. Recommended for most use cases.

PPO Clipping Strategies

From ppo_utils.py:124-148:

def compute_policy_loss(ppo_kl, advantages, eps_clip, eps_clip_high, eps_clip_c=None):
    ratio = (-ppo_kl).exp()
    pg_losses1 = -ratio * advantages
    pg_losses2 = -ratio.clamp(1 - eps_clip, 1 + eps_clip_high) * advantages
    clip_pg_losses1 = torch.maximum(pg_losses1, pg_losses2)
    
    # Dual-clip PPO (optional)
    if eps_clip_c is not None:
        pg_losses3 = -eps_clip_c * advantages
        clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
        pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
    else:
        pg_losses = clip_pg_losses1
    
    return pg_losses, clipfrac

Standard clipping:

--eps-clip 0.2        # Lower bound: 1 - 0.2 = 0.8
--eps-clip-high 0.28  # Upper bound: 1 + 0.28 = 1.28

Dual-clip PPO (aggressive clipping for negative advantages):

--eps-clip 0.2
--eps-clip-high 0.28
--eps-clip-c 3.0  # Extra clipping for advantages < 0

OPSM (Off-Policy Sequence Masking)

OPSM masks sequences that have diverged too far from the behavior policy.

# Enable OPSM
--use-opsm
--opsm-delta 0.1  # KL threshold for masking

From ppo_utils.py:54-92:

def compute_opsm_mask(args, full_log_probs, full_old_log_probs, 
                       advantages, loss_masks):
    opsm_mask_list = []
    
    for full_log_prob, full_old_log_prob, advantage, loss_mask in zip(...):
        # Sequence-level KL
        seq_kl = ((full_old_log_prob - full_log_prob) * loss_mask).sum() / \
                 torch.clamp_min(loss_mask.sum(), 1)
        
        # Mask if: advantage < 0 AND seq_kl > threshold
        mask = ((advantage < 0) & (seq_kl > args.opsm_delta)).float()
        opsm_mask_list.append(1 - mask)
    
    return torch.cat(opsm_mask_list)

On-Policy Distillation (OPD)

OPD allows distilling from a teacher model while training with RL.

OPD_ARGS=(
    --use-opd
    --opd-type megatron  # or sglang
    --opd-kl-coef 0.1    # Weight for reverse KL penalty
)

From loss.py:359-397:

def apply_opd_kl_to_advantages(args, rollout_data, advantages, student_log_probs):
    """Add reverse KL penalty to advantages"""
    teacher_log_probs = rollout_data.get("teacher_log_probs")
    
    reverse_kls = []
    for i, adv in enumerate(advantages):
        # Reverse KL: D_KL(π_student || π_teacher)
        reverse_kl = student_log_probs[i] - teacher_log_probs[i]
        advantages[i] = adv - args.opd_kl_coef * reverse_kl
        reverse_kls.append(reverse_kl)
    
    rollout_data["opd_reverse_kl"] = reverse_kls

Choosing an Algorithm

Start with GRPO

For most tasks, start with GRPO for fast iteration and simplicity.

--advantage-estimator grpo
--n-samples-per-prompt 8

Try GSPO for Better Credit Assignment

If GRPO works but you want better sequence-level coherence:

--advantage-estimator gspo

Scale Up with PPO

If you need maximum sample efficiency and have resources:

--advantage-estimator ppo
--use-critic
--gamma 1.0
--lambd 0.95

Use Reinforce++ for Multi-Step Tasks

If your task involves multi-step reasoning:

--advantage-estimator reinforce_plus_plus
--gamma 0.99

Training Loop

Learn how algorithms fit into the training cycle

Rollout & Reward

Understand reward model integration

Get Started

Core Concepts

Guides

Advanced

Platform Support

RL Algorithms

Overview

Algorithm Selection

GRPO (Group Relative Policy Optimization)

Algorithm Description

Configuration

When to Use GRPO

PPO (Proximal Policy Optimization)

Algorithm Description

Configuration

When to Use PPO

GSPO (Group-Based Sequence-Level Policy Optimization)

Algorithm Description

Configuration

When to Use GSPO

Reinforce++

Algorithm Description

Configuration

When to Use Reinforce++

Algorithm Comparison

Advanced Features

Advantage Normalization

KL Loss Types

PPO Clipping Strategies

OPSM (Off-Policy Sequence Masking)

On-Policy Distillation (OPD)

Choosing an Algorithm

Training Loop

Rollout & Reward

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​Overview

​Algorithm Selection

​GRPO (Group Relative Policy Optimization)

​Algorithm Description

​Configuration

​When to Use GRPO

​PPO (Proximal Policy Optimization)

​Algorithm Description

​Configuration

​When to Use PPO

​GSPO (Group-Based Sequence-Level Policy Optimization)

​Algorithm Description

​Configuration

​When to Use GSPO

​Reinforce++

​Algorithm Description

​Configuration

​When to Use Reinforce++

​Algorithm Comparison

​Advanced Features

​Advantage Normalization

​KL Loss Types

​PPO Clipping Strategies

​OPSM (Off-Policy Sequence Masking)

​On-Policy Distillation (OPD)

​Choosing an Algorithm

​Related Topics

Training Loop

Rollout & Reward

Build docs developers (and LLMs) love

Overview

Algorithm Selection

GRPO (Group Relative Policy Optimization)

Algorithm Description

Configuration

When to Use GRPO

PPO (Proximal Policy Optimization)

Algorithm Description

Configuration

When to Use PPO

GSPO (Group-Based Sequence-Level Policy Optimization)

Algorithm Description

Configuration

When to Use GSPO

Reinforce++

Algorithm Description

Configuration

When to Use Reinforce++

Algorithm Comparison

Advanced Features

Advantage Normalization

KL Loss Types

PPO Clipping Strategies

OPSM (Off-Policy Sequence Masking)

On-Policy Distillation (OPD)

Choosing an Algorithm

Related Topics