Skip to main content

High-Level Architecture

slime is built around a three-module architecture that separates training, inference, and data management concerns. This design enables efficient RL scaling by connecting Megatron-LM for training with SGLang for high-throughput rollout generation.

Core Modules

Training Module (Megatron)

The training module handles the main RL training process using Megatron-LM as the backend.
Key Responsibilities:
  • Parameter updates using actor/critic models
  • Data consumption from the Data Buffer
  • Weight synchronization to rollout engines
  • Checkpoint saving and loading
Implementation Details: From train.py:64-94, the training loop follows this pattern:
for rollout_id in range(args.start_rollout_id, args.num_rollout):
    # 1. Generate rollout data
    rollout_data_ref = ray.get(rollout_manager.generate.remote(rollout_id))
    
    # 2. Train on the data
    if args.use_critic:
        critic_train_handle = critic_model.async_train(rollout_id, rollout_data_ref)
        if rollout_id >= args.num_critic_only_steps:
            ray.get(actor_model.async_train(rollout_id, rollout_data_ref))
        ray.get(critic_train_handle)
    else:
        ray.get(actor_model.async_train(rollout_id, rollout_data_ref))
    
    # 3. Save checkpoints periodically
    if should_run_periodic_action(rollout_id, args.save_interval, ...):
        save(rollout_id)
    
    # 4. Sync weights to rollout engines
    actor_model.update_weights()
Parallelism Support: The training module supports multiple parallelism strategies:
  • Tensor Parallelism (TP): Split model tensors across GPUs
  • Pipeline Parallelism (PP): Split model layers across GPUs
  • Data Parallelism (DP): Replicate model across GPUs
  • Context Parallelism (CP): Split long sequences across GPUs
  • Expert Parallelism (EP): For MoE models, split experts across GPUs

Rollout Module (SGLang + Router)

The rollout module generates new training data by running inference on the current policy.
Key Responsibilities:
  • High-throughput text generation using SGLang
  • Multi-engine load balancing via sgl-router
  • Reward model evaluation
  • Dynamic sampling and filtering
Architecture Components: From placement_group.py:79-119, the rollout module uses Ray placement groups:
def create_placement_groups(args):
    """Create placement groups for actor and rollout engines."""
    
    if args.colocate:
        # Training and inference share GPUs
        num_gpus = args.actor_num_nodes * args.actor_num_gpus_per_node
        rollout_offset = 0
    else:
        # Separate GPU allocation
        num_gpus = (args.actor_num_nodes * args.actor_num_gpus_per_node + 
                   args.rollout_num_gpus)
        rollout_offset = args.actor_num_nodes * args.actor_num_gpus_per_node
SGLang Router: slime uses sgl-router to schedule requests across multiple SGLang servers:
  • DP Size: Calculated as rollout_num_gpus / rollout_num_gpus_per_engine
  • Load Balancing: Supports round-robin, consistent hashing, and custom policies
  • Session Affinity: Maintains KV cache across multi-turn interactions

Data Buffer

The data buffer acts as a bridge between the rollout and training modules.
The data buffer is critical for:
  • Managing prompt datasets and sampling strategies
  • Storing custom data and metadata
  • Implementing dynamic sampling filters
  • Caching partial rollouts for efficiency
Data Flow:
# From sglang_rollout.py:351-442
async def generate_rollout_async(args, rollout_id, data_source):
    target_data_size = args.rollout_batch_size
    data = []
    
    while len(data) < target_data_size:
        # 1. Fetch samples from buffer
        samples = data_source(args.over_sampling_batch_size)
        state.submit_generate_tasks(samples)
        
        # 2. Wait for generation to finish
        done, state.pendings = await asyncio.wait(
            state.pendings, return_when=asyncio.FIRST_COMPLETED
        )
        
        # 3. Apply dynamic filters
        for task in done:
            group = task.result()
            filter_output = call_dynamic_filter(dynamic_filter, args, group)
            if filter_output.keep:
                data.append(group)
    
    # 4. Return data and aborted samples
    aborted_samples = await abort(args, rollout_id)
    return RolloutFnTrainOutput(samples=data), aborted_samples

Deployment Modes

Disaggregated Mode

Training and rollout use separate GPU pools:
python3 train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4
Advantages:
  • Maximum throughput (training and rollout run in parallel)
  • Better GPU utilization
  • Easier scaling of individual components

Colocated Mode

Training and rollout share the same GPUs:
python3 train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --colocate \
    --sglang-mem-fraction-static 0.8
Advantages:
  • Reduced GPU requirements
  • Lower memory transfer overhead
  • Suitable for smaller models or limited GPU clusters
In colocated mode, set --sglang-mem-fraction-static 0.8 to prevent GPU OOM, as Megatron occupies memory before offloading.

Resource Management

slime uses Ray placement groups to manage GPU allocation:
# From placement_group.py:41-76
def _create_placement_group(num_gpus):
    bundles = [{"GPU": 1, "CPU": 1} for _ in range(num_gpus)]
    pg = placement_group(bundles, strategy="PACK")
    
    # Get GPU IDs and sort by node IP and GPU ID
    gpu_ids = ray.get([actor.get_ip_and_gpu_id.remote() for actor in info_actors])
    sorted_bundle_infos = sorted(bundle_infos, key=sort_key)
    
    return pg, reordered_bundle_indices, reordered_gpu_ids

Multi-Node Training

For large-scale MoE models, slime supports multi-node distributed training:
# Node 0 (HEAD)
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8

# Other Nodes
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8

# Submit job from node 0
ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{"env_vars": {"PYTHONPATH": "/root/Megatron-LM/"}}' \
    -- python3 train.py --actor-num-nodes 8 ...

Training Loop

Learn about the Data Sampling → Weight Update cycle

Rollout & Reward

Understand rollout generation and reward models

Build docs developers (and LLMs) love