Overview
Slime supports multiple training backends for different model architectures and scaling requirements.Backend Selection
Backends are selected via the--train-backend argument:
Megatron Backend
Overview
Megatron-LM based backend for large-scale distributed training with advanced parallelism strategies. Features:- Tensor Parallelism (TP)
- Pipeline Parallelism (PP)
- Data Parallelism (DP)
- Context Parallelism (CP)
- Sequence Parallelism (SP)
- Virtual Pipeline Parallelism
- Expert Parallelism for MoE models
MegatronTrainRayActor
Main training actor for Megatron backend.slime/backends/megatron_utils/actor.py:45
Key Configuration
Tensor parallelism degree
Pipeline parallelism degree
Context parallelism for long sequences
Enable sequence parallelism
Number of transformer layers
Hidden dimension size
Number of attention heads
QKV tensor layout: “thd” or “bshd”
Supported Models
Dense Models:- LLaMA (1, 2, 3, 3.1)
- Qwen2, Qwen2.5, Qwen3
- GPT (GPT-OSS)
- GLM4
- Qwen3-MoE
- DeepSeekV3
- GLM4-MoE
- MIMO
- Qwen3-VL
- Qwen3-Next (Vision)
slime/backends/megatron_utils/
FSDP Backend
Overview
PyTorch FSDP2-based backend using native HuggingFace models. Features:- Fully Sharded Data Parallel (FSDP2)
- Native HuggingFace model support
- Optional CPU offloading
- Gradient checkpointing
- Mixed precision training
FSDPTrainRayActor
Training actor for FSDP backend.slime/backends/fsdp_utils/actor.py:34
Key Configuration
Enable FSDP CPU offloading for memory efficiency
Enable activation checkpointing
Optimizer type (currently supports “adam”)
Attention implementation (e.g., “flash_attention_2”)
Supported Models
Dense Models:- Any HuggingFace AutoModel compatible model
- Qwen3-MoE (with custom kernel support)
slime/backends/fsdp_utils/
SGLang Integration
Overview
SGLang provides high-performance inference engines for rollout generation. Features:- RadixAttention prefix caching
- Multi-engine data parallelism
- Speculative decoding
- FP8 quantization
- Continuous batching
Configuration
SGLang tensor parallelism size (same as
--rollout-num-gpus-per-engine)SGLang data parallelism size (number of engines)
Enable torch.compile for SGLang
Speculative decoding algorithm (e.g., “eagle”)
Number of draft tokens for speculation
Enable deterministic sampling
External SGLang
Use external SGLang instances instead of auto-launching:slime/backends/sglang_utils/, slime/utils/arguments.py:459-470
Weight Update Mechanisms
UpdateWeightFromDistributed
Distributed weight transfer for non-colocated setups.- Gather weights from training model
- Convert Megatron format to HuggingFace format
- Send to SGLang via HTTP chunked transfer
- Apply optional quantization (FP8)
slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py
UpdateWeightFromTensor
Direct tensor transfer for colocated setups.- Get weight tensors from training model
- Convert to HuggingFace format
- Directly load into SGLang (shared memory)
slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py
Backend-Specific Features
Megatron Features
Virtual Pipeline Parallelism:FSDP Features
CPU Offloading:Backend Comparison
| Feature | Megatron | FSDP |
|---|---|---|
| Tensor Parallelism | ✓ | ✗ |
| Pipeline Parallelism | ✓ | ✗ |
| Context Parallelism | ✓ | ✗ |
| MoE Support | ✓ | Limited |
| Native HF Models | Via conversion | ✓ |
| Memory Efficiency | High (SP/CP) | Medium |
| Setup Complexity | High | Low |
| Best For | Large models (>70B) | Medium models (<70B) |
Choosing a Backend
Use Megatron when:- Training models >70B parameters
- Need tensor/pipeline parallelism
- Training MoE models with expert parallelism
- Require maximum scalability
- Training models <70B parameters
- Want simple HuggingFace integration
- Prefer lower setup complexity
- Don’t need advanced parallelism
- Training API - Training functions
- Arguments API - Backend configuration