AMD GPU Support

If you encounter problems on AMD Instinct, feel free to reach out to Yusheng Su.

Introduction

This tutorial explains how to set up the development environment for running slime on AMD Instinct GPUs (MI300 and MI325). The guide covers Docker setup, ROCm dependencies, and provides a complete example for training. The current ROCm Docker image only supports AMD’s MI300 and MI325 GPUs.

Docker Setup

Using Pre-built Image

You can download the prebuilt image from DockerHub:

docker pull rlsys/slime:latest

Building From Source

Alternatively, build the image using the provided Dockerfile:

cd docker
docker build -f Dockerfile.rocm -t rlsys/slime:latest .

Acknowledgement: Thanks to Yang Wang for working on the patch for this ROCm base Docker image to support virtual memory management on MI300X.

Quick Start

Launch Docker Container

Start the container with the necessary device access and configurations:

docker run --rm -it \
  --device /dev/dri \
  --device /dev/kfd \
  -p 8265:8265 \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  -v $HOME/.ssh:/root/.ssh \
  -v $HOME:$HOME \
  --shm-size 128G \
  --name slime_dev \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -w $PWD \
  rlsys/slime:latest \
  /bin/bash

Install slime

Clone the repository and install slime:

git clone https://github.com/THUDM/slime.git
cd slime
pip install -e . --no-deps

If you encounter an issue where slime cannot be found later, run pip install -e . --no-deps again in the slime directory.

Download Model and Data

Download the required model checkpoint and datasets:

# HF checkpoint
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
  --local-dir /root/dapo-math-17k

# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
  --local-dir /root/aime-2024

Convert Checkpoint Format

Since slime uses Megatron, which doesn’t support loading HuggingFace checkpoints directly, convert the model to the torch_dist format:

cd slime/
source scripts/models/qwen3-4B.sh
MEGATRON_LM_PATH=$(pip list | grep megatron-core | awk '{print $NF}')
PYTHONPATH=${MEGATRON_LM_PATH} python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --no-gradient-accumulation-fusion \
    --hf-checkpoint /root/Qwen3-4B \
    --save /root/Qwen3-4B_torch_dist

A dedicated AMD conversion script forces a CPU-only conversion workflow using the Gloo backend to bypass hardware-specific issues. A GPU-based script for ROCm is currently in development.

Running Training: Qwen3-4B Example

Run the provided training script for Qwen3-4B:

SLIME_DIR=/root \
MODEL_DIR=/root \
DATA_DIR=/root \
bash scripts/run-qwen3-4B-amd.sh

AMD-Specific Configuration

The main differences between ROCm and NVIDIA training scripts:

# For AMD GPU - set these variables for Ray to function properly
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="1"  # Must set to 1
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"  # Choose which GPUs to use

TODO: ROCm currently doesn’t support apex. You need to disable gradient accumulation fusion by adding the --no-gradient-accumulation-fusion flag in the training script. We will continue investigating how to enable this.

Complete Training Script

Here’s the complete training script adapted for AMD GPUs:

#!/bin/bash

# Cleanup previous runs
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3

set -euxo pipefail

### AMD Support ###
SLIME_DIR="${SLIME_DIR:-/home/yushensu/projects/slime}"
export SLIME_DIR

MODEL_DIR="${MODEL_DIR:-/home/yushensu/projects/model}"
export MODEL_DIR

DATA_DIR="${DATA_DIR:-/home/yushensu/projects/data}"
export DATA_DIR

# For AMD GPU
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="1"  # Must set to 1
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"  # Choose which GPUs to use
####################

export PYTHONBUFFERED=16

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/qwen3-4B.sh"

CKPT_ARGS=(
   --hf-checkpoint ${MODEL_DIR}/Qwen3-4B
   --ref-load ${MODEL_DIR}/Qwen3-4B_torch_dist
   --load ${MODEL_DIR}/Qwen3-4B_slime/
   --save ${MODEL_DIR}/Qwen3-4B_slime/
   --save-interval 20
)

ROLLOUT_ARGS=(
   --prompt-data ${DATA_DIR}/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt
   --label-key label
   --apply-chat-template
   --rollout-shuffle
   --rm-type deepscaler
   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 8192
   --rollout-temperature 1
   --global-batch-size 256
   --balance-data
)

PERF_ARGS=(
   --tensor-model-parallel-size 2
   --sequence-parallel
   --pipeline-model-parallel-size 1
   --context-parallel-size 1
   --recompute-granularity full
   --recompute-method uniform
   --recompute-num-layers 1
   --use-dynamic-batch-size
   --max-tokens-per-gpu 9216
)

MISC_ARGS=(
   --attention-dropout 0.0
   --hidden-dropout 0.0
   --accumulate-allreduce-grads-in-fp32
   --attention-softmax-in-fp32
   --attention-backend flash
   --no-gradient-accumulation-fusion  # AMD: Need to add apex to enable this
)

# Launch Ray head node
export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
NUM_GPUS=$(echo ${HIP_VISIBLE_DEVICES} | tr ',' '\n' | wc -l)
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus ${NUM_GPUS} \
  --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265

MEGATRON_LM_PATH=$(pip list | grep megatron-core | awk '{print $NF}')

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json='{
     "env_vars": {
        "PYTHONPATH": "/workspace/Megatron-LM/",
        "CUDA_DEVICE_MAX_CONNECTIONS": "1"
     }
   }' \
   -- python3 train.py \
   --actor-num-nodes 1 \
   --actor-num-gpus-per-node 8 \
   --colocate \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${PERF_ARGS[@]} \
   ${MISC_ARGS[@]}

# Cleanup after training
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python

Key Differences from NVIDIA Setup

Aspect	AMD (ROCm)	NVIDIA (CUDA)
Device visibility	`HIP_VISIBLE_DEVICES`	`CUDA_VISIBLE_DEVICES`
Ray configuration	`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1`	Not required
Gradient fusion	`--no-gradient-accumulation-fusion` (apex not supported)	Enabled by default
Docker devices	`--device /dev/dri --device /dev/kfd`	`--gpus all`
Base image	ROCm 6.3.4 with virtual memory patch	Standard CUDA images

Troubleshooting

slime module not found

Run the installation command again:

cd slime
pip install -e . --no-deps

Ray cannot detect GPUs

Ensure you’ve set the required environment variables:

export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="1"
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

Gradient accumulation fusion errors

Make sure you’ve added the --no-gradient-accumulation-fusion flag to your training arguments. ROCm doesn’t currently support apex.

Supported Hardware

AMD Instinct MI300 series
AMD Instinct MI325 series

Get Started

Core Concepts

Guides

Advanced

Platform Support

AMD GPU Support

Introduction

Docker Setup

Using Pre-built Image

Building From Source

Quick Start

Running Training: Qwen3-4B Example

AMD-Specific Configuration

Complete Training Script

Key Differences from NVIDIA Setup

Troubleshooting

Supported Hardware

Additional Resources

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​Introduction

​Docker Setup

​Using Pre-built Image

​Building From Source

​Quick Start

​Running Training: Qwen3-4B Example

​AMD-Specific Configuration

​Complete Training Script

​Key Differences from NVIDIA Setup

​Troubleshooting

​Supported Hardware

​Additional Resources

Build docs developers (and LLMs) love

Introduction

Docker Setup

Using Pre-built Image

Building From Source

Quick Start

Running Training: Qwen3-4B Example

AMD-Specific Configuration

Complete Training Script

Key Differences from NVIDIA Setup

Troubleshooting

Supported Hardware

Additional Resources