Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/THUDM/slime/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Slime supports multiple training backends for different model architectures and scaling requirements.

Backend Selection

Backends are selected via the --train-backend argument:
# Megatron-LM backend (default)
python train.py --train-backend megatron

# FSDP backend (HuggingFace native)
python train.py --train-backend fsdp

Megatron Backend

Overview

Megatron-LM based backend for large-scale distributed training with advanced parallelism strategies. Features:
  • Tensor Parallelism (TP)
  • Pipeline Parallelism (PP)
  • Data Parallelism (DP)
  • Context Parallelism (CP)
  • Sequence Parallelism (SP)
  • Virtual Pipeline Parallelism
  • Expert Parallelism for MoE models

MegatronTrainRayActor

Main training actor for Megatron backend.
from slime.backends.megatron_utils.actor import MegatronTrainRayActor

class MegatronTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize Megatron training"""
    
    def async_train(self, rollout_id, rollout_data_ref):
        """Train on rollout data"""
    
    def save_model(self, rollout_id, force_sync=False):
        """Save checkpoint"""
    
    def update_weights(self):
        """Update inference engine weights"""
Source: slime/backends/megatron_utils/actor.py:45

Key Configuration

--tensor-model-parallel-size
int
default:"1"
Tensor parallelism degree
--pipeline-model-parallel-size
int
default:"1"
Pipeline parallelism degree
--context-parallel-size
int
default:"1"
Context parallelism for long sequences
--sequence-parallel
bool
Enable sequence parallelism
--num-layers
int
required
Number of transformer layers
--hidden-size
int
required
Hidden dimension size
--num-attention-heads
int
required
Number of attention heads
--qkv-format
str
default:"thd"
QKV tensor layout: “thd” or “bshd”

Supported Models

Dense Models:
  • LLaMA (1, 2, 3, 3.1)
  • Qwen2, Qwen2.5, Qwen3
  • GPT (GPT-OSS)
  • GLM4
MoE Models:
  • Qwen3-MoE
  • DeepSeekV3
  • GLM4-MoE
  • MIMO
Multimodal:
  • Qwen3-VL
  • Qwen3-Next (Vision)
Source: slime/backends/megatron_utils/

FSDP Backend

Overview

PyTorch FSDP2-based backend using native HuggingFace models. Features:
  • Fully Sharded Data Parallel (FSDP2)
  • Native HuggingFace model support
  • Optional CPU offloading
  • Gradient checkpointing
  • Mixed precision training

FSDPTrainRayActor

Training actor for FSDP backend.
from slime.backends.fsdp_utils.actor import FSDPTrainRayActor

class FSDPTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize FSDP training"""
    
    def train(self, rollout_id, rollout_data):
        """Train on rollout data"""
    
    def save(self, rollout_id, force_sync=False):
        """Save FSDP checkpoint"""
Source: slime/backends/fsdp_utils/actor.py:34

Key Configuration

--fsdp-cpu-offload
bool
default:"False"
Enable FSDP CPU offloading for memory efficiency
--gradient-checkpointing
bool
default:"False"
Enable activation checkpointing
--optimizer
str
default:"adam"
Optimizer type (currently supports “adam”)
--attn-implementation
str
Attention implementation (e.g., “flash_attention_2”)

Supported Models

Dense Models:
  • Any HuggingFace AutoModel compatible model
MoE Models:
  • Qwen3-MoE (with custom kernel support)
Source: slime/backends/fsdp_utils/

SGLang Integration

Overview

SGLang provides high-performance inference engines for rollout generation. Features:
  • RadixAttention prefix caching
  • Multi-engine data parallelism
  • Speculative decoding
  • FP8 quantization
  • Continuous batching

Configuration

--sglang-tp-size
int
default:"1"
SGLang tensor parallelism size (same as --rollout-num-gpus-per-engine)
--sglang-dp-size
int
SGLang data parallelism size (number of engines)
--sglang-enable-torch-compile
bool
default:"False"
Enable torch.compile for SGLang
--sglang-speculative-algorithm
str
Speculative decoding algorithm (e.g., “eagle”)
--sglang-speculative-num-draft-tokens
int
default:"4"
Number of draft tokens for speculation
--sglang-enable-deterministic-inference
bool
default:"False"
Enable deterministic sampling

External SGLang

Use external SGLang instances instead of auto-launching:
python train.py \
  --rollout-external \
  --rollout-external-engine-addrs \
    http://10.0.0.1:30000 \
    http://10.0.0.2:30000
Source: slime/backends/sglang_utils/, slime/utils/arguments.py:459-470

Weight Update Mechanisms

UpdateWeightFromDistributed

Distributed weight transfer for non-colocated setups.
from slime.backends.megatron_utils.update_weight import UpdateWeightFromDistributed

updater = UpdateWeightFromDistributed(
    args,
    model,
    weights_getter=lambda: model.state_dict(),
    model_name="qwen2"
)

updater.update_weights()  # Transfer to SGLang
Mechanism:
  1. Gather weights from training model
  2. Convert Megatron format to HuggingFace format
  3. Send to SGLang via HTTP chunked transfer
  4. Apply optional quantization (FP8)
Source: slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py

UpdateWeightFromTensor

Direct tensor transfer for colocated setups.
from slime.backends.megatron_utils.update_weight import UpdateWeightFromTensor

updater = UpdateWeightFromTensor(
    args,
    model,
    weights_getter=lambda: model.state_dict(),
    model_name="qwen2"
)

updater.update_weights()  # In-memory transfer
Mechanism:
  1. Get weight tensors from training model
  2. Convert to HuggingFace format
  3. Directly load into SGLang (shared memory)
Source: slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py

Backend-Specific Features

Megatron Features

Virtual Pipeline Parallelism:
python train.py \
  --num-layers-per-virtual-pipeline-stage 2
Expert Parallelism (MoE):
python train.py \
  --expert-model-parallel-size 8 \
  --moe-router-topk 6
Routing Replay:
python train.py \
  --use-routing-replay  # Training-time routing replay
  --use-rollout-routing-replay  # Rollout-time routing replay
Custom Model Provider:
python train.py \
  --custom-model-provider-path my_module.my_model_provider

FSDP Features

CPU Offloading:
python train.py \
  --train-backend fsdp \
  --fsdp-cpu-offload
Gradient Checkpointing:
python train.py \
  --train-backend fsdp \
  --gradient-checkpointing

Backend Comparison

FeatureMegatronFSDP
Tensor Parallelism
Pipeline Parallelism
Context Parallelism
MoE SupportLimited
Native HF ModelsVia conversion
Memory EfficiencyHigh (SP/CP)Medium
Setup ComplexityHighLow
Best ForLarge models (>70B)Medium models (<70B)

Choosing a Backend

Use Megatron when:
  • Training models >70B parameters
  • Need tensor/pipeline parallelism
  • Training MoE models with expert parallelism
  • Require maximum scalability
Use FSDP when:
  • Training models <70B parameters
  • Want simple HuggingFace integration
  • Prefer lower setup complexity
  • Don’t need advanced parallelism
See Also:

Build docs developers (and LLMs) love