Skip to main content

Overview

This example demonstrates training Qwen3-4B using GRPO (Group Relative Policy Optimization) on 8×H100 GPUs. The training uses co-located training and inference where all 8 GPUs are shared between training and rollout generation.

Model Specifications

  • Model: Qwen3-4B
  • Parameters: 4 billion
  • Hardware: 8×H100 GPUs
  • Mode: Co-located training and inference
  • Memory: Standard configuration (no CPU Adam required)

Dataset

  • Training Data: dapo-math-17k - 17K mathematical reasoning problems
  • Evaluation Data: AIME 2024 - American Invitational Mathematics Examination problems

Environment Setup

1

Initialize the environment

After pulling the slimerl/slime:latest image, set up the environment:
cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps
2

Download model and data

Download the model checkpoint and datasets:
# Model checkpoint
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024
3

Convert checkpoint

Convert the Hugging Face checkpoint to Megatron-loadable format:
cd /root/slime
source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/Qwen3-4B \
    --save /root/Qwen3-4B_torch_dist

Training Configuration

Parallelism Settings

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --expert-model-parallel-size 1
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 9216
)
TP=2, CP=1 for training with dynamic batch sizing. Each GPU processes up to 9,216 tokens, double the GLM4-9B configuration.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data
)

GRPO Hyperparameters

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Optimizer Configuration

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
)

SGLang Settings

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
   --sglang-mem-fraction-static 0.7
)
SGLang uses TP=2 for inference with mem-fraction-static=0.7 to leave memory for Megatron training.

Run Training

Execute the training script:
cd /root/slime
bash scripts/run-qwen3-4B.sh
The script launches a Ray cluster with co-located training and inference:
ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${OPTIMIZER_ARGS[@]} \
   ${GRPO_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${EVAL_ARGS[@]} \
   ${SGLANG_ARGS[@]} \
   ${MISC_ARGS[@]}

Advanced Features

Decoupled Training and Inference

To separate training and inference GPUs:
ray job submit ... \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 2 \
   --rollout-num-gpus 6 \
   ...
This allocates 2 GPUs for training and 6 GPUs for inference.
If SGLang concurrency exceeds the default CUDA graph limit (160), adjust using:
--sglang-server-concurrency 160  # Limit concurrent requests
# OR
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)  # Increase CUDA graphs

Asynchronous Training

With decoupled training/inference, enable asynchronous training to overlap data generation and training:
ray job submit ... \
   -- python3 train_async.py \
   ...
train_async.py generates data for rollout N+1 while training on rollout N, eliminating GPU idle time.

Dynamic Sampling

Enable DAPO-style dynamic sampling:
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--over-sampling-batch-size 64 \
--dynamic-sampling-filter-path \
  slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
The filter function checks if rewards have non-zero standard deviation:
def check_reward_nonzero_std(args, samples: list[Sample], **kwargs):
    rewards = [sample.reward for sample in samples]
    return torch.tensor(rewards, dtype=torch.float).std() > 0.0
Sampling stops when 32 prompts (256 samples) pass the filter. If too many prompts are discarded, another batch of 64 prompts is sampled.

Partial Rollout

Save partially generated requests during dynamic sampling:
--partial-rollout \
--buffer-filter-path slime.rollout.filter_hub.buffer_filters.pop_first
The default buffer filter retrieves the first N prompts from the buffer:
def pop_first(args, rollout_id, buffer: list[list[Sample]], num_samples: int):
    num_to_pop = min(len(buffer), num_samples)
    samples = buffer[:num_to_pop]
    del buffer[:num_to_pop]
    return samples
Each partial rollout sample stores its original rollout ID in sample.metadata, useful for filtering.

Key Configuration Parameters

MODEL_ARGS

Reads model configuration from scripts/models/qwen3-4B.sh:
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/qwen3-4B.sh"
Ensure --rotary-base and other architecture parameters match your model. Override if needed:
source "${SCRIPT_DIR}/models/qwen3-4B.sh"
MODEL_ARGS+=( --rotary-base 10000 )

CKPT_ARGS

CKPT_ARGS=(
   --hf-checkpoint /root/Qwen3-4B
   --ref-load /root/Qwen3-4B_torch_dist
   --load /root/Qwen3-4B_slime/
   --save /root/Qwen3-4B_slime/
   --save-interval 20
)
  • --hf-checkpoint: HF checkpoint for SGLang and tokenizer
  • --ref-load: Reference model checkpoint (frozen)
  • --load: Actor model checkpoint (if empty, loads from ref-load)
  • --save: Where to save training checkpoints every 20 rollouts

Dynamic Batch Sizing

slime uses data packing with strict per-sample/per-token loss guarantees. Dynamic batch sizing optimizes memory without affecting training.
When --use-dynamic-batch-size is enabled:
  • --max-tokens-per-gpu sets the token limit per GPU (9,216 for Qwen3-4B)
  • Traditional --micro-batch-size is ignored
  • Samples exceeding the limit form their own batch without truncation
  • With CP enabled, GPUs share CP × max_tokens_per_gpu tokens

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)
Evaluation runs every 20 rollouts on AIME 2024 with:
  • 16 samples per prompt
  • Maximum response length of 16,384 tokens
  • Greedy decoding (top-p=1)

Reference

Build docs developers (and LLMs) love