Skip to main content

Overview

This example demonstrates training DeepSeek-R1, a 671B Mixture-of-Experts model, using 128×H100 GPUs across 16 nodes. The configuration uses BF16 for training, FP8 blockwise quantization for inference, advanced 4D parallelism (TP8, PP4, CP4, EP32), and dynamic sampling for data filtering.

Model Specifications

  • Model: DeepSeek-R1 (deepseek-ai/DeepSeek-R1)
  • Architecture: Mixture-of-Experts (MoE)
  • Parameters: 671 billion
  • Hardware: 128×H100 GPUs (16 nodes × 8 GPUs)
  • Training Precision: BF16
  • Inference Precision: FP8 (128×128 blockwise quantization)
  • Parallelism: TP8, PP4, CP4, EP32 (Megatron), EP64 (SGLang)
  • Memory: CPU Adam with ~1.4-1.5TB host memory per node
  • Max Response Length: 32K tokens

Dataset

Environment Setup

For basic environment setup and data download, refer to Qwen3-4B example.
1

Download model to shared storage

Download DeepSeek-R1 to a directory accessible by all machines ($BASE_DIR):
hf download deepseek-ai/DeepSeek-R1 --local-dir $BASE_DIR/DeepSeek-R1
2

Convert FP8 to BF16

The HF checkpoint is in FP8 format. Convert it to BF16 for training:
cd slime/
python tools/fp8_cast_bf16.py \
  --input-fp8-hf-path $BASE_DIR/DeepSeek-R1 \
  --output-bf16-hf-path $BASE_DIR/DeepSeek-R1-bf16/
3

Convert checkpoint to torch_dist (4 nodes)

Convert the BF16 checkpoint to Megatron-loadable format using 4 nodes (32 GPUs):
cd slime/
source scripts/models/deepseek-v3.sh
PYTHONPATH=/root/Megatron-LM/ torchrun \
   --nproc-per-node 8 \
   --master-addr ${MASTER_ADDR} --master-port 12345 \
   --nnodes=4 --node-rank ${NODE_RANK} \
   tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --tensor-model-parallel-size 1 \
   --pipeline-model-parallel-size 8 \
   --expert-tensor-parallel-size 1 \
   --expert-model-parallel-size 4 \
   --decoder-first-pipeline-num-layers 7 \
   --decoder-last-pipeline-num-layers 6 \
   --hf-checkpoint $BASE_DIR/DeepSeek-R1-bf16/ \
   --save $BASE_DIR/DeepSeek-R1_torch_dist/
  • MASTER_ADDR: IP address of node 0
  • NODE_RANK: Node index (0-3)

Training Execution

1

Start Ray on node 0

On node 0, run:
cd slime/
bash scripts/run-deepseek-r1.sh
2

Join Ray cluster on worker nodes

On each worker node, join the Ray cluster:
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 \
  --node-ip-address ${WORKER_IP} --disable-usage-stats

Automated Worker Setup

If you have an MPI hostfile (each line: ip slot=8), add this to scripts/run-deepseek-r1.sh after ray start --head to automate worker setup:
for WORKER_IP in $(awk '{print $1}' $BASE_DIR/mpi_hostfile); do
  if [[ "$WORKER_IP" == "$MASTER_ADDR" ]]; then
    continue
  fi
  echo "Starting Ray worker on ${WORKER_IP}"
  ssh root@"${WORKER_IP}" \
    "pkill -9 sglang ; ray stop --force ; pkill -9 python ; \
     ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 \
       --node-ip-address ${WORKER_IP} --disable-usage-stats" &
done
wait

Training Configuration

MODEL_ARGS

Load model configuration from scripts/models/deepseek-v3.sh:
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/deepseek-v3.sh"
DeepSeek-R1 uses the DeepSeek-v3 architecture configuration.

Checkpoint Configuration (CKPT_ARGS)

CKPT_ARGS=(
   # HF checkpoint for SGLang (FP8 format)
   --hf-checkpoint $BASE_DIR/DeepSeek-R1/
   #--hf-checkpoint $BASE_DIR/DeepSeek-R1-bf16/
   # Reference model checkpoint (BF16 format)
   --ref-load $BASE_DIR/DeepSeek-R1_torch_dist/
   # Actor model checkpoint
   --load $BASE_DIR/DeepSeek-R1_slime/
   --save $BASE_DIR/DeepSeek-R1_slime/
   --save-interval 20
)
slime performs online quantization during training. The FP8 checkpoint enables blockwise quantization of parameters before passing them to SGLang, improving inference throughput.

Advanced Parallelism (PERF_ARGS)

PERF_ARGS=(
   --tensor-model-parallel-size 8
   --sequence-parallel
   --pipeline-model-parallel-size 4
   --context-parallel-size 4
   --expert-model-parallel-size 32
   --expert-tensor-parallel-size 1
   --decoder-last-pipeline-num-layers 13

   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1

   --use-dynamic-batch-size
   --max-tokens-per-gpu 16384
)
4D Parallelism: TP=8, PP=4, CP=4, EP=32 for 128 GPUs.DeepSeek-R1 has 61 layers, which doesn’t divide evenly by PP=4. The last pipeline stage has 13 layers (configured via --decoder-last-pipeline-num-layers 13).

GRPO Configuration

GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)
To train without a reference model, remove --use-kl-loss and ensure --kl-loss-coef 0.00.

CPU Adam Optimization

OPTIMIZER_ARGS=(
   --optimizer adam
   --lr 1e-6
   --lr-decay-style constant
   --weight-decay 0.1
   --adam-beta1 0.9
   --adam-beta2 0.98

   --optimizer-cpu-offload
   --overlap-cpu-optimizer-d2h-h2d
   --use-precision-aware-optimizer
)
CPU Adam saves GPU memory but requires 1.4-1.5TB host memory per node (8×H100). If a single machine lacks sufficient host memory, expand parallelism by adding more GPUs.

SGLang Configuration with deepep

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 64
   --sglang-mem-fraction-static 0.7
   --sglang-ep-size 64

   # DP attention
   --sglang-enable-dp-attention
   --sglang-dp-size 8
   --sglang-moe-dense-tp-size 1
   --sglang-enable-dp-lm-head

   # Enable deepep for SGLang
   --sglang-moe-a2a-backend deepep
   --sglang-deepep-mode auto

   # Allow higher concurrency per DP rank
   --sglang-server-concurrency 1024
)
EP=64, DP=8 for SGLang with deepep for efficient large-scale MoE inference.--sglang-server-concurrency 1024 allows 128 concurrent requests per DP rank (8 DP ranks × 128 = 1024 total).

Miscellaneous Configuration (MISC_ARGS)

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash

   # Use deepep for Megatron
   --moe-enable-deepep
   --moe-token-dispatcher-type flex
)
Megatron’s deepep is configured for efficient MoE communication.

Rollout Configuration

ROLLOUT_ARGS=(
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 32768
   --rollout-temperature 1

   --global-batch-size 256
   --balance-data

   # Dynamic sampling for data filtering
   --over-sampling-batch-size 64
   --dynamic-sampling-filter-path \
     slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
)
Maximum response length: 32K tokens for complex reasoning tasks.Dynamic sampling filters out low-quality data where all responses are correct or all incorrect (zero reward variance).

Evaluation

EVAL_ARGS=(
   --eval-interval 20
   --eval-prompt-data aime /root/aime-2024/aime-2024.jsonl
   --n-samples-per-eval-prompt 16
   --eval-max-response-len 16384
   --eval-top-p 1
)

Key Features

BF16 Training with FP8 Inference

DeepSeek-R1 uses:
  • BF16 training: Full precision training on reference and actor models
  • FP8 inference: 128×128 blockwise quantization for faster rollout generation
slime automatically handles quantization when --hf-checkpoint points to an FP8 model:
CKPT_ARGS=(
   --hf-checkpoint $BASE_DIR/DeepSeek-R1/  # FP8 format
   --ref-load $BASE_DIR/DeepSeek-R1_torch_dist/  # BF16 format
   ...
)

Dynamic Sampling

Dynamic sampling filters low-quality data during rollout generation:
--over-sampling-batch-size 64 \
--dynamic-sampling-filter-path \
  slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
With --rollout-batch-size 32 and --over-sampling-batch-size 64:
  1. Sample 64 prompts with 8 responses each (512 samples)
  2. Filter samples using check_reward_nonzero_std (keeps only prompts with varying rewards)
  3. Stop when 32 prompts (256 samples) pass the filter
  4. If too many are discarded, sample another batch of 64 prompts

Dynamic Batch Sizing

--max-tokens-per-gpu 16384 with --context-parallel-size 4 means each CP group shares 65,536 tokens total. This is essential for handling 32K response lengths efficiently.

Pipeline Parallelism for Non-Divisible Layers

DeepSeek-R1 has 61 layers, which doesn’t divide evenly by PP=4:
  • First 3 pipeline stages: 12 layers each (36 total)
  • Last pipeline stage: 13 layers (configured via --decoder-last-pipeline-num-layers 13)

Memory Requirements

Host Memory: Each node (8×H100) requires 1.4-1.5TB host memory for CPU Adam optimizer states.If insufficient host memory is available, add more GPUs to expand parallelism and reduce per-GPU optimizer state size.

Reference

Build docs developers (and LLMs) love