Skip to main content
If you encounter problems on AMD Instinct, feel free to reach out to Yusheng Su.

Introduction

This tutorial explains how to set up the development environment for running slime on AMD Instinct GPUs (MI300 and MI325). The guide covers Docker setup, ROCm dependencies, and provides a complete example for training. The current ROCm Docker image only supports AMD’s MI300 and MI325 GPUs.

Docker Setup

Using Pre-built Image

You can download the prebuilt image from DockerHub:
docker pull rlsys/slime:latest

Building From Source

Alternatively, build the image using the provided Dockerfile:
cd docker
docker build -f Dockerfile.rocm -t rlsys/slime:latest .
Acknowledgement: Thanks to Yang Wang for working on the patch for this ROCm base Docker image to support virtual memory management on MI300X.

Quick Start

1

Launch Docker Container

Start the container with the necessary device access and configurations:
docker run --rm -it \
  --device /dev/dri \
  --device /dev/kfd \
  -p 8265:8265 \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  -v $HOME/.ssh:/root/.ssh \
  -v $HOME:$HOME \
  --shm-size 128G \
  --name slime_dev \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -w $PWD \
  rlsys/slime:latest \
  /bin/bash
2

Install slime

Clone the repository and install slime:
git clone https://github.com/THUDM/slime.git
cd slime
pip install -e . --no-deps
If you encounter an issue where slime cannot be found later, run pip install -e . --no-deps again in the slime directory.
3

Download Model and Data

Download the required model checkpoint and datasets:
# HF checkpoint
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024
4

Convert Checkpoint Format

Since slime uses Megatron, which doesn’t support loading HuggingFace checkpoints directly, convert the model to the torch_dist format:
cd slime/
source scripts/models/qwen3-4B.sh
MEGATRON_LM_PATH=$(pip list | grep megatron-core | awk '{print $NF}')
PYTHONPATH=${MEGATRON_LM_PATH} python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --no-gradient-accumulation-fusion \
    --hf-checkpoint /root/Qwen3-4B \
    --save /root/Qwen3-4B_torch_dist
A dedicated AMD conversion script forces a CPU-only conversion workflow using the Gloo backend to bypass hardware-specific issues. A GPU-based script for ROCm is currently in development.

Running Training: Qwen3-4B Example

Run the provided training script for Qwen3-4B:
SLIME_DIR=/root \
MODEL_DIR=/root \
DATA_DIR=/root \
bash scripts/run-qwen3-4B-amd.sh

AMD-Specific Configuration

The main differences between ROCm and NVIDIA training scripts:
# For AMD GPU - set these variables for Ray to function properly
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="1"  # Must set to 1
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"  # Choose which GPUs to use
TODO: ROCm currently doesn’t support apex. You need to disable gradient accumulation fusion by adding the --no-gradient-accumulation-fusion flag in the training script. We will continue investigating how to enable this.

Complete Training Script

Here’s the complete training script adapted for AMD GPUs:
#!/bin/bash

# Cleanup previous runs
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3

set -euxo pipefail

### AMD Support ###
SLIME_DIR="${SLIME_DIR:-/home/yushensu/projects/slime}"
export SLIME_DIR

MODEL_DIR="${MODEL_DIR:-/home/yushensu/projects/model}"
export MODEL_DIR

DATA_DIR="${DATA_DIR:-/home/yushensu/projects/data}"
export DATA_DIR

# For AMD GPU
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="1"  # Must set to 1
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"  # Choose which GPUs to use
####################

export PYTHONBUFFERED=16

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/qwen3-4B.sh"

CKPT_ARGS=(
   --hf-checkpoint ${MODEL_DIR}/Qwen3-4B
   --ref-load ${MODEL_DIR}/Qwen3-4B_torch_dist
   --load ${MODEL_DIR}/Qwen3-4B_slime/
   --save ${MODEL_DIR}/Qwen3-4B_slime/
   --save-interval 20
)

ROLLOUT_ARGS=(
   --prompt-data ${DATA_DIR}/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1
   --global-batch-size 256
   --balance-data
)

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1
   --use-dynamic-batch-size
   --max-tokens-per-gpu 9216
)

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash
   --no-gradient-accumulation-fusion  # AMD: Need to add apex to enable this
)

# Launch Ray head node
export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
NUM_GPUS=$(echo ${HIP_VISIBLE_DEVICES} | tr ',' '\n' | wc -l)
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus ${NUM_GPUS} \
  --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265

MEGATRON_LM_PATH=$(pip list | grep megatron-core | awk '{print $NF}')

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json='{
     "env_vars": {
        "PYTHONPATH": "/workspace/Megatron-LM/",
        "CUDA_DEVICE_MAX_CONNECTIONS": "1"
     }
   }' \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${MISC_ARGS[@]}

# Cleanup after training
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python

Key Differences from NVIDIA Setup

AspectAMD (ROCm)NVIDIA (CUDA)
Device visibilityHIP_VISIBLE_DEVICESCUDA_VISIBLE_DEVICES
Ray configurationRAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1Not required
Gradient fusion--no-gradient-accumulation-fusion (apex not supported)Enabled by default
Docker devices--device /dev/dri --device /dev/kfd--gpus all
Base imageROCm 6.3.4 with virtual memory patchStandard CUDA images

Troubleshooting

Run the installation command again:
cd slime
pip install -e . --no-deps
Ensure you’ve set the required environment variables:
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="1"
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
Make sure you’ve added the --no-gradient-accumulation-fusion flag to your training arguments. ROCm doesn’t currently support apex.

Supported Hardware

  • AMD Instinct MI300 series
  • AMD Instinct MI325 series

Additional Resources

Build docs developers (and LLMs) love