Overview
This example demonstrates training Qwen3-4B using GRPO (Group Relative Policy Optimization) on 8×H100 GPUs. The training uses co-located training and inference where all 8 GPUs are shared between training and rollout generation.
Model Specifications
- Model: Qwen3-4B
- Parameters: 4 billion
- Hardware: 8×H100 GPUs
- Mode: Co-located training and inference
- Memory: Standard configuration (no CPU Adam required)
Dataset
- Training Data: dapo-math-17k - 17K mathematical reasoning problems
- Evaluation Data: AIME 2024 - American Invitational Mathematics Examination problems
Environment Setup
Initialize the environment
After pulling the slimerl/slime:latest image, set up the environment:cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps
Download model and data
Download the model checkpoint and datasets:# Model checkpoint
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
Convert checkpoint
Convert the Hugging Face checkpoint to Megatron-loadable format:cd /root/slime
source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-4B \
--save /root/Qwen3-4B_torch_dist
Training Configuration
Parallelism Settings
PERF_ARGS=(
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
--use-dynamic-batch-size
--max-tokens-per-gpu 9216
)
TP=2, CP=1 for training with dynamic batch sizing. Each GPU processes up to 9,216 tokens, double the GLM4-9B configuration.
Rollout Configuration
ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
--input-key prompt
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type deepscaler
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 8192
--rollout-temperature 1
--global-batch-size 256
--balance-data
)
GRPO Hyperparameters
GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)
Optimizer Configuration
OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)
SGLang Settings
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 2
--sglang-mem-fraction-static 0.7
)
SGLang uses TP=2 for inference with mem-fraction-static=0.7 to leave memory for Megatron training.
Run Training
Execute the training script:
cd /root/slime
bash scripts/run-qwen3-4B.sh
The script launches a Ray cluster with co-located training and inference:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]}
Advanced Features
Decoupled Training and Inference
To separate training and inference GPUs:
ray job submit ... \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 2 \
--rollout-num-gpus 6 \
...
This allocates 2 GPUs for training and 6 GPUs for inference.
If SGLang concurrency exceeds the default CUDA graph limit (160), adjust using:--sglang-server-concurrency 160 # Limit concurrent requests
# OR
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256) # Increase CUDA graphs
Asynchronous Training
With decoupled training/inference, enable asynchronous training to overlap data generation and training:
ray job submit ... \
-- python3 train_async.py \
...
train_async.py generates data for rollout N+1 while training on rollout N, eliminating GPU idle time.
Dynamic Sampling
Enable DAPO-style dynamic sampling:
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--over-sampling-batch-size 64 \
--dynamic-sampling-filter-path \
slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
The filter function checks if rewards have non-zero standard deviation:
def check_reward_nonzero_std(args, samples: list[Sample], **kwargs):
rewards = [sample.reward for sample in samples]
return torch.tensor(rewards, dtype=torch.float).std() > 0.0
Sampling stops when 32 prompts (256 samples) pass the filter. If too many prompts are discarded, another batch of 64 prompts is sampled.
Partial Rollout
Save partially generated requests during dynamic sampling:
--partial-rollout \
--buffer-filter-path slime.rollout.filter_hub.buffer_filters.pop_first
The default buffer filter retrieves the first N prompts from the buffer:
def pop_first(args, rollout_id, buffer: list[list[Sample]], num_samples: int):
num_to_pop = min(len(buffer), num_samples)
samples = buffer[:num_to_pop]
del buffer[:num_to_pop]
return samples
Each partial rollout sample stores its original rollout ID in sample.metadata, useful for filtering.
Key Configuration Parameters
MODEL_ARGS
Reads model configuration from scripts/models/qwen3-4B.sh:
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/qwen3-4B.sh"
Ensure --rotary-base and other architecture parameters match your model. Override if needed:source "${SCRIPT_DIR}/models/qwen3-4B.sh"
MODEL_ARGS+=( --rotary-base 10000 )
CKPT_ARGS
CKPT_ARGS=(
--hf-checkpoint /root/Qwen3-4B
--ref-load /root/Qwen3-4B_torch_dist
--load /root/Qwen3-4B_slime/
--save /root/Qwen3-4B_slime/
--save-interval 20
)
--hf-checkpoint: HF checkpoint for SGLang and tokenizer
--ref-load: Reference model checkpoint (frozen)
--load: Actor model checkpoint (if empty, loads from ref-load)
--save: Where to save training checkpoints every 20 rollouts
Dynamic Batch Sizing
slime uses data packing with strict per-sample/per-token loss guarantees. Dynamic batch sizing optimizes memory without affecting training.
When --use-dynamic-batch-size is enabled:
--max-tokens-per-gpu sets the token limit per GPU (9,216 for Qwen3-4B)
- Traditional
--micro-batch-size is ignored
- Samples exceeding the limit form their own batch without truncation
- With CP enabled, GPUs share
CP × max_tokens_per_gpu tokens
Evaluation
EVAL_ARGS=(
--eval-interval 20
--eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
--n-samples-per-eval-prompt 16
--eval-max-response-len 16384
--eval-top-p 1
)
Evaluation runs every 20 rollouts on AIME 2024 with:
- 16 samples per prompt
- Maximum response length of 16,384 tokens
- Greedy decoding (top-p=1)
Reference