Skip to main content

Overview

Slime supports multiple training backends for different model architectures and scaling requirements.

Backend Selection

Backends are selected via the --train-backend argument:
# Megatron-LM backend (default)
python train.py --train-backend megatron

# FSDP backend (HuggingFace native)
python train.py --train-backend fsdp

Megatron Backend

Overview

Megatron-LM based backend for large-scale distributed training with advanced parallelism strategies. Features:
  • Tensor Parallelism (TP)
  • Pipeline Parallelism (PP)
  • Data Parallelism (DP)
  • Context Parallelism (CP)
  • Sequence Parallelism (SP)
  • Virtual Pipeline Parallelism
  • Expert Parallelism for MoE models

MegatronTrainRayActor

Main training actor for Megatron backend.
from slime.backends.megatron_utils.actor import MegatronTrainRayActor

class MegatronTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize Megatron training"""
    
    def async_train(self, rollout_id, rollout_data_ref):
        """Train on rollout data"""
    
    def save_model(self, rollout_id, force_sync=False):
        """Save checkpoint"""
    
    def update_weights(self):
        """Update inference engine weights"""
Source: slime/backends/megatron_utils/actor.py:45

Key Configuration

--tensor-model-parallel-size
int
default:"1"
Tensor parallelism degree
--pipeline-model-parallel-size
int
default:"1"
Pipeline parallelism degree
--context-parallel-size
int
default:"1"
Context parallelism for long sequences
--sequence-parallel
bool
Enable sequence parallelism
--num-layers
int
required
Number of transformer layers
--hidden-size
int
required
Hidden dimension size
--num-attention-heads
int
required
Number of attention heads
--qkv-format
str
default:"thd"
QKV tensor layout: “thd” or “bshd”

Supported Models

Dense Models:
  • LLaMA (1, 2, 3, 3.1)
  • Qwen2, Qwen2.5, Qwen3
  • GPT (GPT-OSS)
  • GLM4
MoE Models:
  • Qwen3-MoE
  • DeepSeekV3
  • GLM4-MoE
  • MIMO
Multimodal:
  • Qwen3-VL
  • Qwen3-Next (Vision)
Source: slime/backends/megatron_utils/

FSDP Backend

Overview

PyTorch FSDP2-based backend using native HuggingFace models. Features:
  • Fully Sharded Data Parallel (FSDP2)
  • Native HuggingFace model support
  • Optional CPU offloading
  • Gradient checkpointing
  • Mixed precision training

FSDPTrainRayActor

Training actor for FSDP backend.
from slime.backends.fsdp_utils.actor import FSDPTrainRayActor

class FSDPTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize FSDP training"""
    
    def train(self, rollout_id, rollout_data):
        """Train on rollout data"""
    
    def save(self, rollout_id, force_sync=False):
        """Save FSDP checkpoint"""
Source: slime/backends/fsdp_utils/actor.py:34

Key Configuration

--fsdp-cpu-offload
bool
default:"False"
Enable FSDP CPU offloading for memory efficiency
--gradient-checkpointing
bool
default:"False"
Enable activation checkpointing
--optimizer
str
default:"adam"
Optimizer type (currently supports “adam”)
--attn-implementation
str
Attention implementation (e.g., “flash_attention_2”)

Supported Models

Dense Models:
  • Any HuggingFace AutoModel compatible model
MoE Models:
  • Qwen3-MoE (with custom kernel support)
Source: slime/backends/fsdp_utils/

SGLang Integration

Overview

SGLang provides high-performance inference engines for rollout generation. Features:
  • RadixAttention prefix caching
  • Multi-engine data parallelism
  • Speculative decoding
  • FP8 quantization
  • Continuous batching

Configuration

--sglang-tp-size
int
default:"1"
SGLang tensor parallelism size (same as --rollout-num-gpus-per-engine)
--sglang-dp-size
int
SGLang data parallelism size (number of engines)
--sglang-enable-torch-compile
bool
default:"False"
Enable torch.compile for SGLang
--sglang-speculative-algorithm
str
Speculative decoding algorithm (e.g., “eagle”)
--sglang-speculative-num-draft-tokens
int
default:"4"
Number of draft tokens for speculation
--sglang-enable-deterministic-inference
bool
default:"False"
Enable deterministic sampling

External SGLang

Use external SGLang instances instead of auto-launching:
python train.py \
  --rollout-external \
  --rollout-external-engine-addrs \
    http://10.0.0.1:30000 \
    http://10.0.0.2:30000
Source: slime/backends/sglang_utils/, slime/utils/arguments.py:459-470

Weight Update Mechanisms

UpdateWeightFromDistributed

Distributed weight transfer for non-colocated setups.
from slime.backends.megatron_utils.update_weight import UpdateWeightFromDistributed

updater = UpdateWeightFromDistributed(
    args,
    model,
    weights_getter=lambda: model.state_dict(),
    model_name="qwen2"
)

updater.update_weights()  # Transfer to SGLang
Mechanism:
  1. Gather weights from training model
  2. Convert Megatron format to HuggingFace format
  3. Send to SGLang via HTTP chunked transfer
  4. Apply optional quantization (FP8)
Source: slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py

UpdateWeightFromTensor

Direct tensor transfer for colocated setups.
from slime.backends.megatron_utils.update_weight import UpdateWeightFromTensor

updater = UpdateWeightFromTensor(
    args,
    model,
    weights_getter=lambda: model.state_dict(),
    model_name="qwen2"
)

updater.update_weights()  # In-memory transfer
Mechanism:
  1. Get weight tensors from training model
  2. Convert to HuggingFace format
  3. Directly load into SGLang (shared memory)
Source: slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py

Backend-Specific Features

Megatron Features

Virtual Pipeline Parallelism:
python train.py \
  --num-layers-per-virtual-pipeline-stage 2
Expert Parallelism (MoE):
python train.py \
  --expert-model-parallel-size 8 \
  --moe-router-topk 6
Routing Replay:
python train.py \
  --use-routing-replay  # Training-time routing replay
  --use-rollout-routing-replay  # Rollout-time routing replay
Custom Model Provider:
python train.py \
  --custom-model-provider-path my_module.my_model_provider

FSDP Features

CPU Offloading:
python train.py \
  --train-backend fsdp \
  --fsdp-cpu-offload
Gradient Checkpointing:
python train.py \
  --train-backend fsdp \
  --gradient-checkpointing

Backend Comparison

FeatureMegatronFSDP
Tensor Parallelism
Pipeline Parallelism
Context Parallelism
MoE SupportLimited
Native HF ModelsVia conversion
Memory EfficiencyHigh (SP/CP)Medium
Setup ComplexityHighLow
Best ForLarge models (>70B)Medium models (<70B)

Choosing a Backend

Use Megatron when:
  • Training models >70B parameters
  • Need tensor/pipeline parallelism
  • Training MoE models with expert parallelism
  • Require maximum scalability
Use FSDP when:
  • Training models <70B parameters
  • Want simple HuggingFace integration
  • Prefer lower setup complexity
  • Don’t need advanced parallelism
See Also:

Build docs developers (and LLMs) love