Overview
While Megatron-LM is highly efficient for parallel training, it can lack the flexibility to support rapidly evolving model architectures. slime provides two approaches to handle cutting-edge models:- HuggingFace Module Wrapping - Import and wrap official HF implementations into Megatron’s pipeline
- FSDP Backend - Use PyTorch’s Fully Sharded Data Parallel for maximum flexibility
Approach 1: HuggingFace Module Wrapping
Instead of deeply re-engineering Megatron, slime can directly import and wrap a model’s official HuggingFace implementation, embedding it as a “black-box” module into Megatron’s parallel training pipeline.How It Works
Megatron’s model instantiation is a two-step process:- Generate a “layer specification” (
ModuleSpec) based on configuration - Instantiate actual PyTorch modules according to that spec
Core Components
Replace the Megatron Module Spec
Use a custom function (e.g.,
get_qwen3_next_spec) to modify the standard ModuleSpec:- Retrieve the standard Decoder Block Spec
- Point its
self_attentionfield to a custom wrapper module - Enable model-specific configurations like
qk_layernorm
slime_plugins/models/qwen3_next.pyWrap the HuggingFace Implementation
The modified spec points to a wrapper layer (e.g.,
HuggingfaceAttention) that:- Inherits from Megatron’s
MegatronModule - Handles data alignment for parallelism strategies (like sequence parallelism)
- Internally calls the native attention module loaded from HuggingFace
slime_plugins/models/hf_attention.pyAlign Model Weights
Use the mbridge library to establish a naming map between HuggingFace checkpoints and Megatron parameters:
- Enables seamless bidirectional conversion
- Handles parameter name mapping automatically
slime_plugins/mbridge/qwen3_next.pyExample: Qwen3Next 80B-A3B
Capabilities
With this approach, you can run complex model architectures (like Gated-Delta-Net) while retaining:- Model parallelism (PP, EP)
- MoE acceleration
- Pipeline scheduling
- Sequence parallelism
Current Limitations
Approach 2: FSDP Backend
For maximum flexibility with modern architectures, slime provides a native FSDP (Fully Sharded Data Parallel) backend that works directly with HuggingFace models.Architecture
The FSDP backend (FSDPTrainRayActor) provides:
- Direct HuggingFace model loading
- PyTorch FSDP2 for efficient distributed training
- Optional CPU offloading for large models
- Full compatibility with slime’s RL training pipeline
Key Features
Native HF Support
Load models directly from HuggingFace without conversion
Memory Efficient
CPU offloading and mixed precision support
Flexible Parallelism
Data parallelism with FSDP2 sharding strategies
RL Compatible
Full integration with slime’s PPO/GRPO algorithms
Configuration
The FSDP backend supports several configuration options:Usage Example
FSDP Implementation Details
Model Initialization
The FSDP backend uses an efficient initialization strategy:Memory Optimization: Non-rank-0 processes use meta tensors (no memory allocation) unless
tie_word_embeddings=True, which requires full CPU initialization to avoid hangs.Data Packing
FSDP backend includes efficient sequence packing:Reference Model Management
For PPO/GRPO training, the FSDP backend maintains a separate reference model:Training Workflow
The FSDP training loop follows slime’s standard RL training pipeline:Advantages Over Megatron
| Feature | FSDP Backend | Megatron-LM |
|---|---|---|
| HF Model Support | Native, no conversion | Requires torch_dist conversion |
| New Architectures | Immediate support | Manual implementation required |
| Tensor Parallelism | Data parallel only | TP, PP, EP support |
| Memory Efficiency | CPU offload, FSDP2 sharding | Gradient checkpointing, ZeRO |
| Development Speed | Fast prototyping | Production-grade performance |
When to Use FSDP
Choose the FSDP backend when:- Working with cutting-edge model architectures (Qwen3Next, Gemma2, etc.)
- Rapid prototyping and experimentation
- Models fit within data-parallel scaling limits
- You need native HuggingFace compatibility
- Training massive models requiring tensor parallelism
- Maximum training throughput is critical
- Using well-supported architectures (GPT, LLaMA, Qwen)
- Production deployment at scale
Supported Backends Comparison
Megatron-LM
Best for: Production training
- Full 3D parallelism (TP/PP/DP)
- Maximum throughput
- Battle-tested at scale
HF Wrapping
Best for: New architectures with Megatron
- Custom attention mechanisms
- Partial Megatron parallelism
- Quick integration
FSDP
Best for: Flexible experimentation
- Native HF models
- Fast iteration
- Data-parallel scaling
Getting Started
Using HF Wrapping
- Implement custom spec generator in
slime_plugins/models/ - Create HF wrapper module in
slime_plugins/models/hf_attention.py - Add weight mapping in
slime_plugins/mbridge/