Overview
This example demonstrates training Qwen3-30B-A3B, a Mixture-of-Experts (MoE) model, on 8×H100 GPUs. The configuration uses CPU Adam to optimize GPU memory usage and supports BF16 training with FP8 inference for improved performance.
Model Specifications
- Model: Qwen3-30B-A3B
- Architecture: Mixture-of-Experts (MoE)
- Hardware: 8×H100 or H800 GPUs
- Optimization: CPU Adam for memory efficiency
- Inference: Optional FP8 quantization
Dataset
Environment Setup
Environment setup, model download, and checkpoint conversion follow the same steps as Qwen3-4B, replacing “Qwen3-4B” with “Qwen3-30B-A3B”.
Initialize environment
cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps
Download model and data
# Model checkpoint
hf download Qwen/Qwen3-30B-A3B --local-dir /root/Qwen3-30B-A3B
# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
Convert checkpoint
cd slime/
pip install -e . --no-deps
source scripts/models/qwen3-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-30B-A3B/ \
--save /root/Qwen3-30B-A3B_torch_dist/
Training Configuration
MoE Parallelism Settings
PERF_ARGS=(
--tensor-model-parallel-size 4
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 8
--expert-tensor-parallel-size 1
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
--use-dynamic-batch-size
--max-tokens-per-gpu 8192
)
TP=4, EP=8 for MoE training. Experts are distributed across all 8 GPUs while tensor parallelism splits the dense layers across 4 GPUs.
CPU Adam Optimization
OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
--optimizer-cpu-offload
--overlap-cpu-optimizer-d2h-h2d
--use-precision-aware-optimizer
)
CPU Adam offloads optimizer states to CPU memory, reducing GPU memory requirements for large MoE models.
SGLang MoE Configuration
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 8
--sglang-mem-fraction-static 0.7
--sglang-ep-size 8
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)
)
Optionally enable DP attention for better MoE inference:
SGLANG_ARGS=(
# ... other args ...
--sglang-enable-dp-attention
--sglang-dp-size 8
)
Run Training
cd /root/slime
bash scripts/run-qwen3-30B-A3B.sh
BF16 Training with FP8 Inference
slime supports training in BF16 precision while using FP8 quantization for inference, improving throughput without significantly impacting quality.
Download FP8 checkpoint
hf download Qwen/Qwen3-30B-A3B-FP8 --local-dir /root/Qwen3-30B-A3B-FP8
Update checkpoint path
Replace the --hf-checkpoint argument:CKPT_ARGS=(
#--hf-checkpoint /root/Qwen3-30B-A3B
--hf-checkpoint /root/Qwen3-30B-A3B-FP8
--ref-load /root/Qwen3-30B-A3B_torch_dist
--load /root/Qwen3-30B-A3B_slime/
--save /root/Qwen3-30B-A3B_slime/
--save-interval 20
)
The Megatron checkpoint for training must still be converted from the BF16 Hugging Face model. Only the inference checkpoint uses FP8.
Currently, slime directly casts BF16 weights to FP8. Future versions will support more sophisticated quantization schemes with less precision impact.
Multi-Node Training
For multi-node deployments:
Shared storage
Place training model and data on a path accessible by all nodes.
Configure MASTER_ADDR
Set MASTER_ADDR to an address accessible by all nodes.
Remove CPU Adam
Distributed optimizer reduces per-GPU memory usage, eliminating the need for CPU Adam:OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
# CPU Adam configurations removed
)
Handling Non-Divisible Expert Counts
When the total GPU count doesn’t divide evenly into the expert count, use redundant experts. Example for 24 GPUs:
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 24
--sglang-mem-fraction-static 0.7
--sglang-ep-size 24
--sglang-enable-dp-attention
--sglang-dp-size 3
--sglang-moe-dense-tp-size 1
--sglang-enable-dp-lm-head
--sglang-ep-num-redundant-experts 16
)
--sglang-ep-num-redundant-experts 16 adds redundant expert replicas to balance load across 24 GPUs.
Rollout Configuration
ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
--input-key prompt
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type deepscaler
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 8192
--rollout-temperature 1
--global-batch-size 256
--balance-data
)
GRPO Configuration
GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)
Evaluation
EVAL_ARGS=(
--eval-interval 20
--eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
--n-samples-per-eval-prompt 16
--eval-max-response-len 16384
--eval-top-p 1
)
Miscellaneous Settings
MISC_ARGS=(
--attention-dropout 0.0
--hidden-dropout 0.0
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
--attention-backend flash
)
Reference