GLM-4.7-Flash (30B MoE) with 8×H100

Overview

This example demonstrates training GLM-4.7-Flash, a 30B Mixture-of-Experts (MoE) model with 64 routed experts and 1 shared expert, on 8×H100 GPUs. The configuration uses CPU Adam to fit the model in limited GPU memory and optionally enables MTP speculative decoding for faster inference.

Model Specifications

Model: GLM-4.7-Flash (THUDM/GLM-4.7-Flash)
Architecture: Mixture-of-Experts (MoE)
- 64 routed experts (top-4 activation)
- 1 shared expert
- 47 layers: 1 dense + 46 MoE layers
- 1 MTP (Multi-Token Prediction) layer for speculative decoding
Parameters: ~30 billion (4.7B active per token)
Hardware: 8×H100 GPUs (single node)
Optimization: CPU Adam for memory efficiency

Dataset

Training Data: dapo-math-17k
Evaluation Data: AIME 2024

Environment Setup

Initialize environment

Set up the slime environment (see Qwen3-4B example for detailed instructions):

cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps

Download model

hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash

Download datasets

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024

Convert checkpoint

Convert the Hugging Face checkpoint to torch_dist format:

cd /root/slime
pip install -e . --no-deps
source scripts/models/glm4.7-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
   tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint /root/GLM-4.7-Flash/ \
   --save /root/GLM-4.7-Flash_torch_dist/

Training Configuration

MoE Parallelism Settings

PERF_ARGS=(
   --tensor-model-parallel-size 1
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --expert-model-parallel-size 8
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 8192
)

TP=1, EP=8 for single-node MoE training. Each GPU handles 8 experts (64 experts ÷ 8 GPUs).

CPU Adam Optimization

To fit the model on 8×H100, enable CPU Adam to offload optimizer states:

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98

   --optimizer-cpu-offload
   --overlap-cpu-optimizer-d2h-h2d
   --use-precision-aware-optimizer
)

CPU Adam significantly reduces GPU memory but requires more host memory. For multi-node setups with distributed optimizer, CPU Adam can be disabled.

SGLang MoE Configuration

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 8
   --sglang-mem-fraction-static 0.7
   --sglang-enable-dp-attention
   --sglang-dp-size 8
   --sglang-enable-dp-lm-head
   --sglang-moe-dense-tp-size 1

   --sglang-cuda-graph-max-bs 16
   --sglang-max-running-requests 64
)

SGLang uses DP attention with EP=8 for efficient MoE inference across all 8 GPUs.

MTP Speculative Decoding (Optional)

GLM-4.7-Flash includes an MTP layer for EAGLE-style speculative decoding to accelerate inference:

SGLANG_ARGS=(
   # ... other args ...
   
   # MTP speculative decoding (EAGLE)
   --sglang-speculative-algorithm EAGLE
   --sglang-speculative-num-steps 2
   --sglang-speculative-eagle-topk 1
   --sglang-speculative-num-draft-tokens 3
)

The MTP layer predicts multiple future tokens, which SGLang verifies in parallel for faster generation.

Speculative decoding requires additional GPU memory. If you encounter OOM errors, reduce --sglang-mem-fraction-static or disable speculative decoding.

Run Training

For single-node 8×H100:

cd /root/slime
bash scripts/run-glm4.7-30B-A3B-8gpus.sh

The script executes:

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${OPTIMIZER_ARGS[@]} \
   ${GRPO_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${EVAL_ARGS[@]} \
   ${SGLANG_ARGS[@]} \
   ${MISC_ARGS[@]} \
   ${SPEC_ARGS[@]}

Multi-Node Training

For multi-node setups (e.g., 2×8 H100):

cd /root/slime
export BASE_DIR=/shared/path  # accessible by all nodes
bash scripts/run-glm4.7-30B-A3B.sh

Place data on shared storage

Ensure model and data are on a path accessible by all nodes.

Set MASTER_ADDR

Set MASTER_ADDR to an address reachable by all nodes.

Remove CPU Adam

Remove CPU Adam configurations:

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
   # CPU Adam removed - distributed optimizer reduces memory per GPU
)

Adjust parallelism

Example for 16 GPUs: TP=4, PP=2, EP=8, CP=2

Handling Non-Divisible Expert Counts

When GPU count doesn’t divide evenly into 64 experts (e.g., 24 GPUs):

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 24
   --sglang-mem-fraction-static 0.7
   --sglang-ep-size 24
   --sglang-enable-dp-attention
   --sglang-dp-size 3
   --sglang-moe-dense-tp-size 1
   --sglang-enable-dp-lm-head
   --sglang-ep-num-redundant-experts 16
)

--sglang-ep-num-redundant-experts 16 adds redundant expert copies to distribute workload evenly across 24 GPUs.

MTP Training (Advanced)

slime supports training MTP layers jointly with the main model:

# Add MTP layer count to model config
MODEL_ARGS+=(--mtp-num-layers 1)

# Enable MTP training
SPEC_ARGS=(
   --enable-mtp-training
   --mtp-loss-scaling-factor 0.2
)

--mtp-num-layers 1: Loads the MTP layer from checkpoint
--enable-mtp-training: Enables gradient computation for MTP layers
--mtp-loss-scaling-factor 0.2: Weight of MTP loss relative to policy loss

MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge doesn’t include MTP weight conversion (# TODO: mtp). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally.For models with full MTP support (e.g., MiMo), see scripts/run-mimo-7B-rl-eagle.sh.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data
)

GRPO Configuration

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

Miscellaneous Settings

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash

   --moe-token-dispatcher-type alltoall
)

Reference

run-glm4.7-30B-A3B-8gpus.sh - Single-node script
run-glm4.7-30B-A3B.sh - Multi-node script
scripts/models/glm4.7-30B-A3B.sh - Model configuration

Training Examples

Use Cases

GLM-4.7-Flash (30B MoE) with 8×H100

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

MoE Parallelism Settings

CPU Adam Optimization

SGLang MoE Configuration

MTP Speculative Decoding (Optional)

Run Training

Multi-Node Training

Handling Non-Divisible Expert Counts

MTP Training (Advanced)

Rollout Configuration

GRPO Configuration

Evaluation

Miscellaneous Settings

Reference

Build docs developers (and LLMs) love

Training Examples

Use Cases

Documentation Index

​Overview

​Model Specifications

​Dataset

​Environment Setup

​Training Configuration

​MoE Parallelism Settings

​CPU Adam Optimization

​SGLang MoE Configuration

​MTP Speculative Decoding (Optional)

​Run Training

​Multi-Node Training

​Handling Non-Divisible Expert Counts

​MTP Training (Advanced)

​Rollout Configuration

​GRPO Configuration

​Evaluation

​Miscellaneous Settings

​Reference

Build docs developers (and LLMs) love

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

MoE Parallelism Settings

CPU Adam Optimization

SGLang MoE Configuration

MTP Speculative Decoding (Optional)

Run Training

Multi-Node Training

Handling Non-Divisible Expert Counts

MTP Training (Advanced)

Rollout Configuration

GRPO Configuration

Evaluation

Miscellaneous Settings

Reference