Skip to main content

Overview

This example demonstrates training GLM-4.7-Flash, a 30B Mixture-of-Experts (MoE) model with 64 routed experts and 1 shared expert, on 8×H100 GPUs. The configuration uses CPU Adam to fit the model in limited GPU memory and optionally enables MTP speculative decoding for faster inference.

Model Specifications

  • Model: GLM-4.7-Flash (THUDM/GLM-4.7-Flash)
  • Architecture: Mixture-of-Experts (MoE)
    • 64 routed experts (top-4 activation)
    • 1 shared expert
    • 47 layers: 1 dense + 46 MoE layers
    • 1 MTP (Multi-Token Prediction) layer for speculative decoding
  • Parameters: ~30 billion (4.7B active per token)
  • Hardware: 8×H100 GPUs (single node)
  • Optimization: CPU Adam for memory efficiency

Dataset

Environment Setup

1

Initialize environment

Set up the slime environment (see Qwen3-4B example for detailed instructions):
cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps
2

Download model

hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
3

Download datasets

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024
4

Convert checkpoint

Convert the Hugging Face checkpoint to torch_dist format:
cd /root/slime
pip install -e . --no-deps
source scripts/models/glm4.7-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
   tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint /root/GLM-4.7-Flash/ \
   --save /root/GLM-4.7-Flash_torch_dist/

Training Configuration

MoE Parallelism Settings

PERF_ARGS=(
   --tensor-model-parallel-size 1
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --expert-model-parallel-size 8
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 8192
)
TP=1, EP=8 for single-node MoE training. Each GPU handles 8 experts (64 experts ÷ 8 GPUs).

CPU Adam Optimization

To fit the model on 8×H100, enable CPU Adam to offload optimizer states:
OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98

   --optimizer-cpu-offload
   --overlap-cpu-optimizer-d2h-h2d
   --use-precision-aware-optimizer
)
CPU Adam significantly reduces GPU memory but requires more host memory. For multi-node setups with distributed optimizer, CPU Adam can be disabled.

SGLang MoE Configuration

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 8
   --sglang-mem-fraction-static 0.7
   --sglang-enable-dp-attention
   --sglang-dp-size 8
   --sglang-enable-dp-lm-head
   --sglang-moe-dense-tp-size 1

   --sglang-cuda-graph-max-bs 16
   --sglang-max-running-requests 64
)
SGLang uses DP attention with EP=8 for efficient MoE inference across all 8 GPUs.

MTP Speculative Decoding (Optional)

GLM-4.7-Flash includes an MTP layer for EAGLE-style speculative decoding to accelerate inference:
SGLANG_ARGS=(
   # ... other args ...
   
   # MTP speculative decoding (EAGLE)
   --sglang-speculative-algorithm EAGLE
   --sglang-speculative-num-steps 2
   --sglang-speculative-eagle-topk 1
   --sglang-speculative-num-draft-tokens 3
)
The MTP layer predicts multiple future tokens, which SGLang verifies in parallel for faster generation.
Speculative decoding requires additional GPU memory. If you encounter OOM errors, reduce --sglang-mem-fraction-static or disable speculative decoding.

Run Training

For single-node 8×H100:
cd /root/slime
bash scripts/run-glm4.7-30B-A3B-8gpus.sh
The script executes:
ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${OPTIMIZER_ARGS[@]} \
   ${GRPO_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${EVAL_ARGS[@]} \
   ${SGLANG_ARGS[@]} \
   ${MISC_ARGS[@]} \
   ${SPEC_ARGS[@]}

Multi-Node Training

For multi-node setups (e.g., 2×8 H100):
cd /root/slime
export BASE_DIR=/shared/path  # accessible by all nodes
bash scripts/run-glm4.7-30B-A3B.sh
1

Place data on shared storage

Ensure model and data are on a path accessible by all nodes.
2

Set MASTER_ADDR

Set MASTER_ADDR to an address reachable by all nodes.
3

Remove CPU Adam

Remove CPU Adam configurations:
OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
   # CPU Adam removed - distributed optimizer reduces memory per GPU
)
4

Adjust parallelism

Example for 16 GPUs: TP=4, PP=2, EP=8, CP=2

Handling Non-Divisible Expert Counts

When GPU count doesn’t divide evenly into 64 experts (e.g., 24 GPUs):
SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 24
   --sglang-mem-fraction-static 0.7
   --sglang-ep-size 24
   --sglang-enable-dp-attention
   --sglang-dp-size 3
   --sglang-moe-dense-tp-size 1
   --sglang-enable-dp-lm-head
   --sglang-ep-num-redundant-experts 16
)
--sglang-ep-num-redundant-experts 16 adds redundant expert copies to distribute workload evenly across 24 GPUs.

MTP Training (Advanced)

slime supports training MTP layers jointly with the main model:
# Add MTP layer count to model config
MODEL_ARGS+=(--mtp-num-layers 1)

# Enable MTP training
SPEC_ARGS=(
   --enable-mtp-training
   --mtp-loss-scaling-factor 0.2
)
  • --mtp-num-layers 1: Loads the MTP layer from checkpoint
  • --enable-mtp-training: Enables gradient computation for MTP layers
  • --mtp-loss-scaling-factor 0.2: Weight of MTP loss relative to policy loss
MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge doesn’t include MTP weight conversion (# TODO: mtp). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally.For models with full MTP support (e.g., MiMo), see scripts/run-mimo-7B-rl-eagle.sh.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data
)

GRPO Configuration

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

Miscellaneous Settings

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash

   --moe-token-dispatcher-type alltoall
)

Reference

Build docs developers (and LLMs) love