Qwen3-4B with 8×H100

Overview

This example demonstrates training Qwen3-4B using GRPO (Group Relative Policy Optimization) on 8×H100 GPUs. The training uses co-located training and inference where all 8 GPUs are shared between training and rollout generation.

Model Specifications

Model: Qwen3-4B
Parameters: 4 billion
Hardware: 8×H100 GPUs
Mode: Co-located training and inference
Memory: Standard configuration (no CPU Adam required)

Dataset

Training Data: dapo-math-17k - 17K mathematical reasoning problems
Evaluation Data: AIME 2024 - American Invitational Mathematics Examination problems

Environment Setup

Initialize the environment

After pulling the slimerl/slime:latest image, set up the environment:

cd /root/
git clone https://github.com/THUDM/slime.git
cd slime/
pip install -e . --no-deps

Download model and data

Download the model checkpoint and datasets:

# Model checkpoint
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024

Convert checkpoint

Convert the Hugging Face checkpoint to Megatron-loadable format:

cd /root/slime
source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/Qwen3-4B \
    --save /root/Qwen3-4B_torch_dist

Training Configuration

Parallelism Settings

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --expert-model-parallel-size 1
   --expert-tensor-parallel-size 1

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 9216
)

TP=2, CP=1 for training with dynamic batch sizing. Each GPU processes up to 9,216 tokens, double the GLM4-9B configuration.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data
)

GRPO Hyperparameters

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)

Optimizer Configuration

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98
)

SGLang Settings

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
   --sglang-mem-fraction-static 0.7
)

SGLang uses TP=2 for inference with mem-fraction-static=0.7 to leave memory for Megatron training.

Run Training

Execute the training script:

cd /root/slime
bash scripts/run-qwen3-4B.sh

The script launches a Ray cluster with co-located training and inference:

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${OPTIMIZER_ARGS[@]} \
   ${GRPO_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${EVAL_ARGS[@]} \
   ${SGLANG_ARGS[@]} \
   ${MISC_ARGS[@]}

Advanced Features

Decoupled Training and Inference

To separate training and inference GPUs:

ray job submit ... \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 2 \
   --rollout-num-gpus 6 \
   ...

This allocates 2 GPUs for training and 6 GPUs for inference.

If SGLang concurrency exceeds the default CUDA graph limit (160), adjust using:

--sglang-server-concurrency 160  # Limit concurrent requests
# OR
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)  # Increase CUDA graphs

Asynchronous Training

With decoupled training/inference, enable asynchronous training to overlap data generation and training:

ray job submit ... \
   -- python3 train_async.py \
   ...

train_async.py generates data for rollout N+1 while training on rollout N, eliminating GPU idle time.

Dynamic Sampling

Enable DAPO-style dynamic sampling:

--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--over-sampling-batch-size 64 \
--dynamic-sampling-filter-path \
  slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std

The filter function checks if rewards have non-zero standard deviation:

def check_reward_nonzero_std(args, samples: list[Sample], **kwargs):
    rewards = [sample.reward for sample in samples]
    return torch.tensor(rewards, dtype=torch.float).std() > 0.0

Sampling stops when 32 prompts (256 samples) pass the filter. If too many prompts are discarded, another batch of 64 prompts is sampled.

Partial Rollout

Save partially generated requests during dynamic sampling:

--partial-rollout \
--buffer-filter-path slime.rollout.filter_hub.buffer_filters.pop_first

The default buffer filter retrieves the first N prompts from the buffer:

def pop_first(args, rollout_id, buffer: list[list[Sample]], num_samples: int):
    num_to_pop = min(len(buffer), num_samples)
    samples = buffer[:num_to_pop]
    del buffer[:num_to_pop]
    return samples

Each partial rollout sample stores its original rollout ID in sample.metadata, useful for filtering.

Key Configuration Parameters

MODEL_ARGS

Reads model configuration from scripts/models/qwen3-4B.sh:

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/qwen3-4B.sh"

Ensure --rotary-base and other architecture parameters match your model. Override if needed:

source "${SCRIPT_DIR}/models/qwen3-4B.sh"
MODEL_ARGS+=( --rotary-base 10000 )

CKPT_ARGS

CKPT_ARGS=(
   --hf-checkpoint /root/Qwen3-4B
   --ref-load /root/Qwen3-4B_torch_dist
   --load /root/Qwen3-4B_slime/
   --save /root/Qwen3-4B_slime/
   --save-interval 20
)

--hf-checkpoint: HF checkpoint for SGLang and tokenizer
--ref-load: Reference model checkpoint (frozen)
--load: Actor model checkpoint (if empty, loads from ref-load)
--save: Where to save training checkpoints every 20 rollouts

Dynamic Batch Sizing

slime uses data packing with strict per-sample/per-token loss guarantees. Dynamic batch sizing optimizes memory without affecting training.

When --use-dynamic-batch-size is enabled:

--max-tokens-per-gpu sets the token limit per GPU (9,216 for Qwen3-4B)
Traditional --micro-batch-size is ignored
Samples exceeding the limit form their own batch without truncation
With CP enabled, GPUs share CP × max_tokens_per_gpu tokens

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

Evaluation runs every 20 rollouts on AIME 2024 with:

16 samples per prompt
Maximum response length of 16,384 tokens
Greedy decoding (top-p=1)

Reference

run-qwen3-4B.sh - Full training script
scripts/models/qwen3-4B.sh - Model configuration

Training Examples

Use Cases

Qwen3-4B with 8×H100

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

Parallelism Settings

Rollout Configuration

GRPO Hyperparameters

Optimizer Configuration

SGLang Settings

Run Training

Advanced Features

Decoupled Training and Inference

Asynchronous Training

Dynamic Sampling

Partial Rollout

Key Configuration Parameters

MODEL_ARGS

CKPT_ARGS

Dynamic Batch Sizing

Evaluation

Reference

Build docs developers (and LLMs) love

Training Examples

Use Cases

Documentation Index

​Overview

​Model Specifications

​Dataset

​Environment Setup

​Training Configuration

​Parallelism Settings

​Rollout Configuration

​GRPO Hyperparameters

​Optimizer Configuration

​SGLang Settings

​Run Training

​Advanced Features

​Decoupled Training and Inference

​Asynchronous Training

​Dynamic Sampling

​Partial Rollout

​Key Configuration Parameters

​MODEL_ARGS

​CKPT_ARGS

​Dynamic Batch Sizing

​Evaluation

​Reference

Build docs developers (and LLMs) love

Overview

Model Specifications

Dataset

Environment Setup

Training Configuration

Parallelism Settings

Rollout Configuration

GRPO Hyperparameters

Optimizer Configuration

SGLang Settings

Run Training

Advanced Features

Decoupled Training and Inference

Asynchronous Training

Dynamic Sampling

Partial Rollout

Key Configuration Parameters

MODEL_ARGS

CKPT_ARGS

Dynamic Batch Sizing

Evaluation

Reference