Skip to main content

Overview

The training API provides the main entry points for running RL training loops, creating training models (actor and critic), and managing the training lifecycle.

Main Training Function

train()

The main training loop that orchestrates rollout generation, model training, and evaluation.
from slime.train import train
from slime.utils.arguments import parse_args

args = parse_args()
train(args)
Training Loop Steps:
  1. Configure logger and allocate GPUs via placement groups
  2. Initialize tracking (wandb/tensorboard)
  3. Create rollout manager with inference engines
  4. Create actor and critic models
  5. Execute training loop:
    • Generate rollout data
    • Train critic model (if enabled)
    • Train actor model
    • Update weights in inference engines
    • Periodically save checkpoints and run evaluation
Key Features:
  • Supports synchronous and asynchronous training modes
  • Optional weight offloading for memory efficiency
  • Configurable evaluation intervals
  • Checkpoint saving with async support
args
Namespace
required
Parsed arguments containing all training configuration. Use parse_args() to create.
Source: train.py:9, train_async.py:10

Model Creation

create_training_models()

Creates and initializes actor and critic models with proper placement groups.
from slime.ray.placement_group import create_training_models

actor_model, critic_model = create_training_models(args, pgs, rollout_manager)
args
Namespace
required
Training arguments
pgs
dict
required
Dictionary containing placement groups for actor, critic, and rollout:
{
    "actor": (pg, bundle_indices, gpu_ids),
    "critic": (pg, bundle_indices, gpu_ids),  # if args.use_critic
    "rollout": (pg, bundle_indices, gpu_ids)
}
rollout_manager
RolloutManager
required
Initialized rollout manager instance
Returns
tuple[RayTrainGroup, RayTrainGroup | None]
  • actor_model: Ray actor group for policy training
  • critic_model: Ray actor group for value function training (None if not using critic)
Source: slime/ray/placement_group.py:132

Training Model Classes

MegatronTrainRayActor

Ray actor for Megatron-based distributed training.
class MegatronTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize training actor"""
    
    def async_train(self, rollout_id, rollout_data_ref):
        """Asynchronously train on rollout data"""
    
    def save_model(self, rollout_id, force_sync=False):
        """Save model checkpoint"""
    
    def update_weights(self):
        """Update weights in inference engines"""
Key Methods:
init
method
Initialize the training actor with model, optimizer, and optional reference/teacher models.Parameters:
  • args (Namespace): Training configuration
  • role (str): Either “actor” or “critic”
  • with_ref (bool): Whether to load reference model for KL penalty
  • with_opd_teacher (bool): Whether to load teacher model for on-policy distillation
Returns: int - Starting rollout ID
async_train
method
Train the model on rollout data asynchronously.Parameters:
  • rollout_id (int): Current rollout step
  • rollout_data_ref (ray.ObjectRef): Reference to rollout data in Ray object store
Returns: Ray ObjectRef for async execution
save_model
method
Save model checkpoint to disk.Parameters:
  • rollout_id (int): Current rollout step
  • force_sync (bool): Whether to force synchronous save (default: False)
update_weights
method
Update weights in rollout inference engines from training model. Uses efficient weight transfer mechanisms (tensor-based for colocated, distributed for separate).
Source: slime/backends/megatron_utils/actor.py:45

FSDPTrainRayActor

Ray actor for FSDP (Fully Sharded Data Parallel) based training using pure HuggingFace models.
class FSDPTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize FSDP training actor"""
    
    def train(self, rollout_id, rollout_data):
        """Train on rollout data"""
    
    def save(self, rollout_id, force_sync=False):
        """Save FSDP checkpoint"""
Features:
  • Native HuggingFace model support with FSDP2 wrapping
  • Optional CPU offloading for memory efficiency
  • Compatible with gradient checkpointing
  • Supports AdamW optimizer
Source: slime/backends/fsdp_utils/actor.py:34

Placement Groups

create_placement_groups()

Create Ray placement groups for distributing actor, critic, and rollout engines across GPUs.
from slime.ray.placement_group import create_placement_groups

pgs = create_placement_groups(args)
args
Namespace
required
Must contain:
  • actor_num_nodes: Number of nodes for actor
  • actor_num_gpus_per_node: GPUs per node for actor
  • critic_num_nodes: Number of nodes for critic (if using)
  • critic_num_gpus_per_node: GPUs per node for critic
  • rollout_num_gpus: Total GPUs for rollout engines
  • colocate: Whether to colocate training and inference
Returns
dict
Dictionary with keys:
  • "actor": Tuple of (placement_group, bundle_indices, gpu_ids)
  • "critic": Tuple of (placement_group, bundle_indices, gpu_ids) or None
  • "rollout": Tuple of (placement_group, bundle_indices, gpu_ids)
Source: slime/ray/placement_group.py:79

Training Configuration

Key Arguments

Model Architecture:
  • --actor-num-nodes: Number of nodes for actor training
  • --actor-num-gpus-per-node: GPUs per node for actor
  • --critic-num-nodes: Number of nodes for critic (optional)
  • --use-critic: Enable critic model for PPO
Training Strategy:
  • --colocate: Colocate training and inference on same GPUs
  • --offload-train: Offload training model to CPU during rollout
  • --offload-rollout: Offload rollout model during training
  • --true-on-policy-mode: Enable strict on-policy training
Optimization:
  • --lr: Learning rate (default: 1e-6)
  • --clip-grad: Gradient clipping value (default: 1.0)
  • --global-batch-size: Global batch size across all ranks
  • --micro-batch-size: Micro batch size per GPU
See Also:

Build docs developers (and LLMs) love