Overview
This example demonstrates training GLM-Z1-9B-0414 using GRPO (Group Relative Policy Optimization) on 8×H100 GPUs. The training uses decoupled training and inference with 4 GPUs for training and 4 GPUs for rollout generation.
Model Specifications
- Model: GLM-Z1-9B-0414
- Parameters: 9 billion
- Hardware: 8×H100 GPUs
- Training GPUs: 4
- Inference GPUs: 4
- Memory: Standard configuration (no CPU Adam required)
Dataset
- Training Data: dapo-math-17k - 17K mathematical reasoning problems
- Evaluation Data: AIME 2024 - American Invitational Mathematics Examination problems
Environment Setup
Initialize the environment
After pulling the slimerl/slime:latest image, set up the environment:cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps
Download model and data
Download the model checkpoint and datasets:# Model checkpoint
hf download zai-org/GLM-Z1-9B-0414 --local-dir /root/GLM-Z1-9B-0414
# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
Convert checkpoint
Convert the Hugging Face checkpoint to Megatron-loadable format:cd /root/slime
source scripts/models/glm4-9B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/GLM-Z1-9B-0414 \
--save /root/GLM-Z1-9B-0414_torch_dist
Training Configuration
Parallelism Settings
PERF_ARGS=(
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 2
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
--use-dynamic-batch-size
--max-tokens-per-gpu 4608
)
TP=2, CP=2 for training with dynamic batch sizing. Each GPU processes up to 4,608 tokens.
Rollout Configuration
ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
--input-key prompt
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type deepscaler
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 8192
--rollout-temperature 1
--global-batch-size 256
--balance-data
)
GRPO Hyperparameters
GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)
Optimizer Configuration
OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)
SGLang Settings
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 2
)
SGLang uses TP=2 for inference, corresponding to 4 GPUs total for rollout generation.
Run Training
Execute the training script:
cd /root/slime
bash scripts/run-glm4-9B.sh
The script launches a Ray cluster and submits the training job:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]}
Advanced Features
Co-located Training and Inference
To run training and inference on the same GPUs:
ray job submit ... \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
...
When using co-located mode, adjust --sglang-mem-fraction-static to reduce SGLang’s memory usage, as Megatron will always occupy some GPU memory.
Dynamic Sampling
Enable DAPO-style dynamic sampling to filter low-quality data:
--over-sampling-batch-size 64 \
--dynamic-sampling-filter-path \
slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
With --rollout-batch-size 32 and --over-sampling-batch-size 64, the system samples 64 prompts but only keeps data where rewards have non-zero standard deviation (filtering out all-correct or all-incorrect responses).
Partial Rollout
Save partially generated requests during dynamic sampling:
This allows aborted requests to be resumed in the next rollout, improving efficiency.
Key Configuration Parameters
MODEL_ARGS
Reads model configuration from scripts/models/glm4-9B.sh. These are Megatron parameters that define the model architecture.
Ensure --rotary-base and other architecture parameters match your model exactly. You can override parameters after loading:source "${SCRIPT_DIR}/models/glm4-9B.sh"
MODEL_ARGS+=( --rotary-base 10000 )
CKPT_ARGS
CKPT_ARGS=(
--hf-checkpoint /root/GLM-Z1-9B-0414
--ref-load /root/GLM-Z1-9B-0414_torch_dist
--load /root/GLM-Z1-9B-0414_slime/
--save /root/GLM-Z1-9B-0414_slime/
--save-interval 20
)
--hf-checkpoint: HF checkpoint for SGLang and tokenizer
--ref-load: Reference model checkpoint (frozen)
--load: Actor model checkpoint (if empty, loads from ref-load)
--save: Where to save training checkpoints
Dynamic Batch Sizing
slime uses data packing with strict per-sample/per-token loss guarantees. Dynamic batch sizing optimizes memory usage without affecting loss calculation. It’s recommended to enable it.
When --use-dynamic-batch-size is enabled:
--max-tokens-per-gpu specifies the maximum tokens per GPU
- Traditional
--micro-batch-size is ignored
- With CP enabled, GPUs share
CP × max_tokens_per_gpu tokens total
- Single samples exceeding the limit form their own batch (no truncation)
Evaluation
EVAL_ARGS=(
--eval-interval 20
--eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
--n-samples-per-eval-prompt 16
--eval-max-response-len 16384
--eval-top-p 1
)
Evaluation runs every 20 rollouts on AIME 2024 with 16 samples per prompt and greedy decoding (top-p=1).
Reference