slime is built on Ray for distributed execution, enabling training of large models across multiple nodes. This guide covers Ray cluster setup, multi-node configuration, and optimization for large-scale MOE models.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/THUDM/slime/llms.txt
Use this file to discover all available pages before exploring further.
Overview
slime’s distributed architecture:- Ray Cluster: Manages resources and job scheduling across nodes
- Training (Actor): Megatron-based model training
- Inference (Rollout): SGLang-based response generation
- Coordination: Ray handles communication and synchronization
Single Node Setup
For single-node training (one machine with multiple GPUs):Start Ray Head Node
Submit Training Job
Resource Allocation
Number of nodes for training actor.
GPUs per node allocated to training.
Total GPUs for inference. Ignored when using
--colocate.Total GPUs available per node. Important when using
--colocate with fewer than 8 GPUs.Multi-Node Setup
For training large models across multiple machines:Step 1: Start Ray Cluster
On Head Node (Node 0)
On Worker Nodes (Node 1, 2, …)
Step 2: Verify Cluster
Check cluster status:Step 3: Submit Multi-Node Job
Colocated Mode
Colocated mode runs training and inference on the same GPUs, saving resources.Configuration
Memory Management
When to Use Colocated Mode
Use Colocated
- Limited GPU resources
- Small to medium models
- Training throughput is priority
Use Disaggregated
- Large GPU clusters available
- Maximum inference throughput needed
- Independent scaling of training/inference
Network Configuration
Environment Variables
For multi-node setups, you may need to configure network interfaces:SLURM + Enroot Example
For SLURM clusters with enroot containers:Large-Scale MOE Models
slime is optimized for training massive Mixture of Experts models.Example: GLM-4.5 (355B, 32 experts)
Training configuration for 64xH100 GPUs (8 nodes × 8 GPUs):Parallelism Strategy
For large MOE models, carefully tune parallelism:Advanced NCCL Tuning
For optimal multi-node performance:Monitoring
Ray Dashboard
Access the Ray dashboard athttp://<head-node-ip>:8265:
- View cluster resources
- Monitor job status
- Check GPU utilization
- Inspect logs
Weights & Biases Integration
Checkpointing
For large models, use async checkpointing:Troubleshooting
Common Issues
Ray workers not connecting
Ray workers not connecting
Symptoms: Workers don’t appear in
ray statusSolutions:- Verify firewall allows port 6379
- Check
MASTER_ADDRis correct and reachable - Ensure same Ray version on all nodes
- Try
ray stop --forceon all nodes and restart
NCCL timeout errors
NCCL timeout errors
Symptoms:
NCCL error: unhandled system errorSolutions:- Increase timeout:
--distributed-timeout-minutes 20 - Check network interface: Set
GLOO_SOCKET_IFNAMEandNCCL_SOCKET_IFNAME - Verify InfiniBand configuration if using IB
- Reduce
--global-batch-sizeto decrease communication
Out of memory in colocated mode
Out of memory in colocated mode
Symptoms: CUDA OOM during rolloutSolutions:
- Reduce
--sglang-mem-fraction-staticto 0.7 or lower - Decrease
--max-tokens-per-gpu - Enable
--recompute-granularity full - Use
--optimizer-cpu-offload
Slow data loading
Slow data loading
Symptoms: Long wait times between rolloutsSolutions:
- Use shared filesystem (NFS, Lustre) for datasets
- Enable
--balance-datafor better load distribution - Increase
--rollout-batch-sizeto amortize overhead - Check network bandwidth between nodes
Debugging Commands
Performance Optimization
Prefetching and Pipelining
Dynamic Sampling with Buffer
Load Balancing
Resource Planning
Memory Estimation
Approximate GPU memory requirements:Scaling Guidelines
| Model Size | Recommended GPUs | Parallelism Strategy |
|---|---|---|
| 7-13B | 8 GPUs | TP=2, PP=1 |
| 30-70B | 16-32 GPUs | TP=4, PP=2 |
| 100-200B | 32-64 GPUs | TP=8, PP=4 |
| 300B+ MOE | 64+ GPUs | TP=8, PP=4, EP=16 |
Next Steps
Configuration Guide
Review all configuration parameters
Multi-Turn Agents
Train agents with tool calling