Skip to main content

Overview

This example demonstrates training Qwen3-30B-A3B, a Mixture-of-Experts (MoE) model, on 8×H100 GPUs. The configuration uses CPU Adam to optimize GPU memory usage and supports BF16 training with FP8 inference for improved performance.

Model Specifications

  • Model: Qwen3-30B-A3B
  • Architecture: Mixture-of-Experts (MoE)
  • Hardware: 8×H100 or H800 GPUs
  • Optimization: CPU Adam for memory efficiency
  • Inference: Optional FP8 quantization

Dataset

Environment Setup

Environment setup, model download, and checkpoint conversion follow the same steps as Qwen3-4B, replacing “Qwen3-4B” with “Qwen3-30B-A3B”.
1

Initialize environment

cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps
2

Download model and data

# Model checkpoint
hf download Qwen/Qwen3-30B-A3B --local-dir /root/Qwen3-30B-A3B

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024
3

Convert checkpoint

cd slime/
pip install -e . --no-deps
source scripts/models/qwen3-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
   tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint /root/Qwen3-30B-A3B/ \
   --save /root/Qwen3-30B-A3B_torch_dist/

Training Configuration

MoE Parallelism Settings

PERF_ARGS=(
   --tensor-model-parallel-size 4
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --expert-model-parallel-size 8
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 8192
)
TP=4, EP=8 for MoE training. Experts are distributed across all 8 GPUs while tensor parallelism splits the dense layers across 4 GPUs.

CPU Adam Optimization

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98

   --optimizer-cpu-offload
   --overlap-cpu-optimizer-d2h-h2d
   --use-precision-aware-optimizer
)
CPU Adam offloads optimizer states to CPU memory, reducing GPU memory requirements for large MoE models.

SGLang MoE Configuration

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 8
   --sglang-mem-fraction-static 0.7
   --sglang-ep-size 8
   --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)
)
Optionally enable DP attention for better MoE inference:
SGLANG_ARGS=(
   # ... other args ...
   --sglang-enable-dp-attention
   --sglang-dp-size 8
)

Run Training

cd /root/slime
bash scripts/run-qwen3-30B-A3B.sh

BF16 Training with FP8 Inference

slime supports training in BF16 precision while using FP8 quantization for inference, improving throughput without significantly impacting quality.
1

Download FP8 checkpoint

hf download Qwen/Qwen3-30B-A3B-FP8 --local-dir /root/Qwen3-30B-A3B-FP8
2

Update checkpoint path

Replace the --hf-checkpoint argument:
CKPT_ARGS=(
   #--hf-checkpoint /root/Qwen3-30B-A3B
   --hf-checkpoint /root/Qwen3-30B-A3B-FP8
   --ref-load /root/Qwen3-30B-A3B_torch_dist
   --load /root/Qwen3-30B-A3B_slime/
   --save /root/Qwen3-30B-A3B_slime/
   --save-interval 20
)
The Megatron checkpoint for training must still be converted from the BF16 Hugging Face model. Only the inference checkpoint uses FP8.
Currently, slime directly casts BF16 weights to FP8. Future versions will support more sophisticated quantization schemes with less precision impact.

Multi-Node Training

For multi-node deployments:
1

Shared storage

Place training model and data on a path accessible by all nodes.
2

Configure MASTER_ADDR

Set MASTER_ADDR to an address accessible by all nodes.
3

Remove CPU Adam

Distributed optimizer reduces per-GPU memory usage, eliminating the need for CPU Adam:
OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
   # CPU Adam configurations removed
)

Handling Non-Divisible Expert Counts

When the total GPU count doesn’t divide evenly into the expert count, use redundant experts. Example for 24 GPUs:
SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 24
   --sglang-mem-fraction-static 0.7
   --sglang-ep-size 24
   --sglang-enable-dp-attention
   --sglang-dp-size 3

   --sglang-moe-dense-tp-size 1
   --sglang-enable-dp-lm-head
   --sglang-ep-num-redundant-experts 16
)
--sglang-ep-num-redundant-experts 16 adds redundant expert replicas to balance load across 24 GPUs.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data
)

GRPO Configuration

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

Miscellaneous Settings

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash
)

Reference

Build docs developers (and LLMs) love