Qwen3-30B-A3B with 8×H100

Overview

This example demonstrates training Qwen3-30B-A3B, a Mixture-of-Experts (MoE) model, on 8×H100 GPUs. The configuration uses CPU Adam to optimize GPU memory usage and supports BF16 training with FP8 inference for improved performance.

Model Specifications

Model: Qwen3-30B-A3B
Architecture: Mixture-of-Experts (MoE)
Hardware: 8×H100 or H800 GPUs
Optimization: CPU Adam for memory efficiency
Inference: Optional FP8 quantization

Dataset

Training Data: dapo-math-17k
Evaluation Data: AIME 2024

Environment Setup

Environment setup, model download, and checkpoint conversion follow the same steps as Qwen3-4B, replacing “Qwen3-4B” with “Qwen3-30B-A3B”.

Initialize environment

cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps

Download model and data

# Model checkpoint
hf download Qwen/Qwen3-30B-A3B --local-dir /root/Qwen3-30B-A3B

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024

Convert checkpoint

cd slime/
pip install -e . --no-deps
source scripts/models/qwen3-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
   tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint /root/Qwen3-30B-A3B/ \
   --save /root/Qwen3-30B-A3B_torch_dist/

Training Configuration

MoE Parallelism Settings

PERF_ARGS=(
   --tensor-model-parallel-size 4
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --expert-model-parallel-size 8
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 8192
)

TP=4, EP=8 for MoE training. Experts are distributed across all 8 GPUs while tensor parallelism splits the dense layers across 4 GPUs.

CPU Adam Optimization

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98

   --optimizer-cpu-offload
   --overlap-cpu-optimizer-d2h-h2d
   --use-precision-aware-optimizer
)

CPU Adam offloads optimizer states to CPU memory, reducing GPU memory requirements for large MoE models.

SGLang MoE Configuration

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 8
   --sglang-mem-fraction-static 0.7
   --sglang-ep-size 8
   --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)
)

Optionally enable DP attention for better MoE inference:

SGLANG_ARGS=(
   # ... other args ...
   --sglang-enable-dp-attention
   --sglang-dp-size 8
)

Run Training

cd /root/slime
bash scripts/run-qwen3-30B-A3B.sh

BF16 Training with FP8 Inference

slime supports training in BF16 precision while using FP8 quantization for inference, improving throughput without significantly impacting quality.

Download FP8 checkpoint

hf download Qwen/Qwen3-30B-A3B-FP8 --local-dir /root/Qwen3-30B-A3B-FP8

Update checkpoint path

Replace the --hf-checkpoint argument:

CKPT_ARGS=(
   #--hf-checkpoint /root/Qwen3-30B-A3B
   --hf-checkpoint /root/Qwen3-30B-A3B-FP8
   --ref-load /root/Qwen3-30B-A3B_torch_dist
   --load /root/Qwen3-30B-A3B_slime/
   --save /root/Qwen3-30B-A3B_slime/
   --save-interval 20
)

The Megatron checkpoint for training must still be converted from the BF16 Hugging Face model. Only the inference checkpoint uses FP8.

Currently, slime directly casts BF16 weights to FP8. Future versions will support more sophisticated quantization schemes with less precision impact.

Multi-Node Training

For multi-node deployments:

Shared storage

Place training model and data on a path accessible by all nodes.

Configure MASTER_ADDR

Set MASTER_ADDR to an address accessible by all nodes.

Remove CPU Adam

Distributed optimizer reduces per-GPU memory usage, eliminating the need for CPU Adam:

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
   # CPU Adam configurations removed
)

Handling Non-Divisible Expert Counts

When the total GPU count doesn’t divide evenly into the expert count, use redundant experts. Example for 24 GPUs:

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 24
   --sglang-mem-fraction-static 0.7
   --sglang-ep-size 24
   --sglang-enable-dp-attention
   --sglang-dp-size 3

   --sglang-moe-dense-tp-size 1
   --sglang-enable-dp-lm-head
   --sglang-ep-num-redundant-experts 16
)

--sglang-ep-num-redundant-experts 16 adds redundant expert replicas to balance load across 24 GPUs.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data
)

GRPO Configuration

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

Miscellaneous Settings

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash
)

Reference

run-qwen3-30B-A3B.sh - Full training script
scripts/models/qwen3-30B-A3B.sh - Model configuration

Training Examples

Use Cases

Qwen3-30B-A3B with 8×H100

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

MoE Parallelism Settings

CPU Adam Optimization

SGLang MoE Configuration

Run Training

BF16 Training with FP8 Inference

Multi-Node Training

Handling Non-Divisible Expert Counts

Rollout Configuration

GRPO Configuration

Evaluation

Miscellaneous Settings

Reference

Build docs developers (and LLMs) love

Training Examples

Use Cases

Documentation Index

​Overview

​Model Specifications

​Dataset

​Environment Setup

​Training Configuration

​MoE Parallelism Settings

​CPU Adam Optimization

​SGLang MoE Configuration

​Run Training

​BF16 Training with FP8 Inference

​Multi-Node Training

​Handling Non-Divisible Expert Counts

​Rollout Configuration

​GRPO Configuration

​Evaluation

​Miscellaneous Settings

​Reference

Build docs developers (and LLMs) love

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

MoE Parallelism Settings

CPU Adam Optimization

SGLang MoE Configuration

Run Training

BF16 Training with FP8 Inference

Multi-Node Training

Handling Non-Divisible Expert Counts

Rollout Configuration

GRPO Configuration

Evaluation

Miscellaneous Settings

Reference