Overview
This example demonstrates training GLM-4.7-Flash, a 30B Mixture-of-Experts (MoE) model with 64 routed experts and 1 shared expert, on 8×H100 GPUs. The configuration uses CPU Adam to fit the model in limited GPU memory and optionally enables MTP speculative decoding for faster inference.
Model Specifications
- Model: GLM-4.7-Flash (THUDM/GLM-4.7-Flash)
- Architecture: Mixture-of-Experts (MoE)
- 64 routed experts (top-4 activation)
- 1 shared expert
- 47 layers: 1 dense + 46 MoE layers
- 1 MTP (Multi-Token Prediction) layer for speculative decoding
- Parameters: ~30 billion (4.7B active per token)
- Hardware: 8×H100 GPUs (single node)
- Optimization: CPU Adam for memory efficiency
Dataset
Environment Setup
Initialize environment
Set up the slime environment (see Qwen3-4B example for detailed instructions):cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps
Download model
hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
Download datasets
# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
Convert checkpoint
Convert the Hugging Face checkpoint to torch_dist format:cd /root/slime
pip install -e . --no-deps
source scripts/models/glm4.7-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/GLM-4.7-Flash/ \
--save /root/GLM-4.7-Flash_torch_dist/
Training Configuration
MoE Parallelism Settings
PERF_ARGS=(
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 8
--expert-tensor-parallel-size 1
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
--use-dynamic-batch-size
--max-tokens-per-gpu 8192
)
TP=1, EP=8 for single-node MoE training. Each GPU handles 8 experts (64 experts ÷ 8 GPUs).
CPU Adam Optimization
To fit the model on 8×H100, enable CPU Adam to offload optimizer states:
OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
--optimizer-cpu-offload
--overlap-cpu-optimizer-d2h-h2d
--use-precision-aware-optimizer
)
CPU Adam significantly reduces GPU memory but requires more host memory. For multi-node setups with distributed optimizer, CPU Adam can be disabled.
SGLang MoE Configuration
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 8
--sglang-mem-fraction-static 0.7
--sglang-enable-dp-attention
--sglang-dp-size 8
--sglang-enable-dp-lm-head
--sglang-moe-dense-tp-size 1
--sglang-cuda-graph-max-bs 16
--sglang-max-running-requests 64
)
SGLang uses DP attention with EP=8 for efficient MoE inference across all 8 GPUs.
MTP Speculative Decoding (Optional)
GLM-4.7-Flash includes an MTP layer for EAGLE-style speculative decoding to accelerate inference:
SGLANG_ARGS=(
# ... other args ...
# MTP speculative decoding (EAGLE)
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 2
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 3
)
The MTP layer predicts multiple future tokens, which SGLang verifies in parallel for faster generation.
Speculative decoding requires additional GPU memory. If you encounter OOM errors, reduce --sglang-mem-fraction-static or disable speculative decoding.
Run Training
For single-node 8×H100:
cd /root/slime
bash scripts/run-glm4.7-30B-A3B-8gpus.sh
The script executes:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]} \
${SPEC_ARGS[@]}
Multi-Node Training
For multi-node setups (e.g., 2×8 H100):
cd /root/slime
export BASE_DIR=/shared/path # accessible by all nodes
bash scripts/run-glm4.7-30B-A3B.sh
Place data on shared storage
Ensure model and data are on a path accessible by all nodes.
Set MASTER_ADDR
Set MASTER_ADDR to an address reachable by all nodes.
Remove CPU Adam
Remove CPU Adam configurations:OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
# CPU Adam removed - distributed optimizer reduces memory per GPU
)
Adjust parallelism
Example for 16 GPUs: TP=4, PP=2, EP=8, CP=2
Handling Non-Divisible Expert Counts
When GPU count doesn’t divide evenly into 64 experts (e.g., 24 GPUs):
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 24
--sglang-mem-fraction-static 0.7
--sglang-ep-size 24
--sglang-enable-dp-attention
--sglang-dp-size 3
--sglang-moe-dense-tp-size 1
--sglang-enable-dp-lm-head
--sglang-ep-num-redundant-experts 16
)
--sglang-ep-num-redundant-experts 16 adds redundant expert copies to distribute workload evenly across 24 GPUs.
MTP Training (Advanced)
slime supports training MTP layers jointly with the main model:
# Add MTP layer count to model config
MODEL_ARGS+=(--mtp-num-layers 1)
# Enable MTP training
SPEC_ARGS=(
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
)
--mtp-num-layers 1: Loads the MTP layer from checkpoint
--enable-mtp-training: Enables gradient computation for MTP layers
--mtp-loss-scaling-factor 0.2: Weight of MTP loss relative to policy loss
MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge doesn’t include MTP weight conversion (# TODO: mtp). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally.For models with full MTP support (e.g., MiMo), see scripts/run-mimo-7B-rl-eagle.sh.
Rollout Configuration
ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
--input-key prompt
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type deepscaler
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 8192
--rollout-temperature 1
--global-batch-size 256
--balance-data
)
GRPO Configuration
GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)
Evaluation
EVAL_ARGS=(
--eval-interval 20
--eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
--n-samples-per-eval-prompt 16
--eval-max-response-len 16384
--eval-top-p 1
)
Miscellaneous Settings
MISC_ARGS=(
--attention-dropout 0.0
--hidden-dropout 0.0
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
--attention-backend flash
--moe-token-dispatcher-type alltoall
)
Reference