GLM4-9B with 8×H100

Overview

This example demonstrates training GLM-Z1-9B-0414 using GRPO (Group Relative Policy Optimization) on 8×H100 GPUs. The training uses decoupled training and inference with 4 GPUs for training and 4 GPUs for rollout generation.

Model Specifications

Model: GLM-Z1-9B-0414
Parameters: 9 billion
Hardware: 8×H100 GPUs
Training GPUs: 4
Inference GPUs: 4
Memory: Standard configuration (no CPU Adam required)

Dataset

Training Data: dapo-math-17k - 17K mathematical reasoning problems
Evaluation Data: AIME 2024 - American Invitational Mathematics Examination problems

Environment Setup

Initialize the environment

After pulling the slimerl/slime:latest image, set up the environment:

cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps

Download model and data

Download the model checkpoint and datasets:

# Model checkpoint
hf download zai-org/GLM-Z1-9B-0414 --local-dir /root/GLM-Z1-9B-0414

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024

Convert checkpoint

Convert the Hugging Face checkpoint to Megatron-loadable format:

cd /root/slime
source scripts/models/glm4-9B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/GLM-Z1-9B-0414 \
    --save /root/GLM-Z1-9B-0414_torch_dist

Training Configuration

Parallelism Settings

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 2
   --expert-model-parallel-size 1
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 4608
)

TP=2, CP=2 for training with dynamic batch sizing. Each GPU processes up to 4,608 tokens.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle

   --rm-type deepscaler

   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data
)

GRPO Hyperparameters

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Optimizer Configuration

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
)

SGLang Settings

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
)

SGLang uses TP=2 for inference, corresponding to 4 GPUs total for rollout generation.

Run Training

Execute the training script:

cd /root/slime
bash scripts/run-glm4-9B.sh

The script launches a Ray cluster and submits the training job:

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 4 \
   --rollout-num-gpus 4 \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${OPTIMIZER_ARGS[@]} \
   ${GRPO_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${EVAL_ARGS[@]} \
   ${SGLANG_ARGS[@]} \
   ${MISC_ARGS[@]}

Advanced Features

Co-located Training and Inference

To run training and inference on the same GPUs:

ray job submit ... \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   ...

When using co-located mode, adjust --sglang-mem-fraction-static to reduce SGLang’s memory usage, as Megatron will always occupy some GPU memory.

Dynamic Sampling

Enable DAPO-style dynamic sampling to filter low-quality data:

--over-sampling-batch-size 64 \
--dynamic-sampling-filter-path \
  slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std

With --rollout-batch-size 32 and --over-sampling-batch-size 64, the system samples 64 prompts but only keeps data where rewards have non-zero standard deviation (filtering out all-correct or all-incorrect responses).

Partial Rollout

Save partially generated requests during dynamic sampling:

--partial-rollout

This allows aborted requests to be resumed in the next rollout, improving efficiency.

Key Configuration Parameters

MODEL_ARGS

Reads model configuration from scripts/models/glm4-9B.sh. These are Megatron parameters that define the model architecture.

Ensure --rotary-base and other architecture parameters match your model exactly. You can override parameters after loading:

source "${SCRIPT_DIR}/models/glm4-9B.sh"
MODEL_ARGS+=( --rotary-base 10000 )

CKPT_ARGS

CKPT_ARGS=(
   --hf-checkpoint /root/GLM-Z1-9B-0414
   --ref-load /root/GLM-Z1-9B-0414_torch_dist
   --load /root/GLM-Z1-9B-0414_slime/
   --save /root/GLM-Z1-9B-0414_slime/
   --save-interval 20
)

--hf-checkpoint: HF checkpoint for SGLang and tokenizer
--ref-load: Reference model checkpoint (frozen)
--load: Actor model checkpoint (if empty, loads from ref-load)
--save: Where to save training checkpoints

Dynamic Batch Sizing

slime uses data packing with strict per-sample/per-token loss guarantees. Dynamic batch sizing optimizes memory usage without affecting loss calculation. It’s recommended to enable it.

When --use-dynamic-batch-size is enabled:

--max-tokens-per-gpu specifies the maximum tokens per GPU
Traditional --micro-batch-size is ignored
With CP enabled, GPUs share CP × max_tokens_per_gpu tokens total
Single samples exceeding the limit form their own batch (no truncation)

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

Evaluation runs every 20 rollouts on AIME 2024 with 16 samples per prompt and greedy decoding (top-p=1).

Reference

run-glm4-9B.sh - Full training script
scripts/models/glm4-9B.sh - Model configuration

Training Examples

Use Cases

GLM4-9B with 8×H100

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

Parallelism Settings

Rollout Configuration

GRPO Hyperparameters

Optimizer Configuration

SGLang Settings

Run Training

Advanced Features

Co-located Training and Inference

Dynamic Sampling

Partial Rollout

Key Configuration Parameters

MODEL_ARGS

CKPT_ARGS

Dynamic Batch Sizing

Evaluation

Reference

Build docs developers (and LLMs) love

Training Examples

Use Cases

Documentation Index

​Overview

​Model Specifications

​Dataset

​Environment Setup

​Training Configuration

​Parallelism Settings

​Rollout Configuration

​GRPO Hyperparameters

​Optimizer Configuration

​SGLang Settings

​Run Training

​Advanced Features

​Co-located Training and Inference

​Dynamic Sampling

​Partial Rollout

​Key Configuration Parameters

​MODEL_ARGS

​CKPT_ARGS

​Dynamic Batch Sizing

​Evaluation

​Reference

Build docs developers (and LLMs) love

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

Parallelism Settings

Rollout Configuration

GRPO Hyperparameters

Optimizer Configuration

SGLang Settings

Run Training

Advanced Features

Co-located Training and Inference

Dynamic Sampling

Partial Rollout

Key Configuration Parameters

MODEL_ARGS

CKPT_ARGS

Dynamic Batch Sizing

Evaluation

Reference