If you encounter problems on AMD Instinct, feel free to reach out to Yusheng Su .
Introduction
This tutorial explains how to set up the development environment for running slime on AMD Instinct GPUs (MI300 and MI325). The guide covers Docker setup, ROCm dependencies, and provides a complete example for training.
The current ROCm Docker image only supports AMD’s MI300 and MI325 GPUs.
Docker Setup
Using Pre-built Image
You can download the prebuilt image from DockerHub:
docker pull rlsys/slime:latest
Building From Source
Alternatively, build the image using the provided Dockerfile:
cd docker
docker build -f Dockerfile.rocm -t rlsys/slime:latest .
Quick Start
Launch Docker Container
Start the container with the necessary device access and configurations: docker run --rm -it \
--device /dev/dri \
--device /dev/kfd \
-p 8265:8265 \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME /.ssh:/root/.ssh \
-v $HOME : $HOME \
--shm-size 128G \
--name slime_dev \
--ulimit memlock= -1 \
--ulimit stack= 67108864 \
-w $PWD \
rlsys/slime:latest \
/bin/bash
Install slime
Clone the repository and install slime: git clone https://github.com/THUDM/slime.git
cd slime
pip install -e . --no-deps
If you encounter an issue where slime cannot be found later, run pip install -e . --no-deps again in the slime directory.
Download Model and Data
Download the required model checkpoint and datasets: # HF checkpoint
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
# Training data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k
# Evaluation data
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024
Convert Checkpoint Format
Since slime uses Megatron, which doesn’t support loading HuggingFace checkpoints directly, convert the model to the torch_dist format: cd slime/
source scripts/models/qwen3-4B.sh
MEGATRON_LM_PATH = $( pip list | grep megatron-core | awk '{print $NF}' )
PYTHONPATH = ${ MEGATRON_LM_PATH } python tools/convert_hf_to_torch_dist.py \
${ MODEL_ARGS [ @ ]} \
--no-gradient-accumulation-fusion \
--hf-checkpoint /root/Qwen3-4B \
--save /root/Qwen3-4B_torch_dist
A dedicated AMD conversion script forces a CPU-only conversion workflow using the Gloo backend to bypass hardware-specific issues. A GPU-based script for ROCm is currently in development.
Running Training: Qwen3-4B Example
Run the provided training script for Qwen3-4B:
SLIME_DIR = /root \
MODEL_DIR=/root \
DATA_DIR=/root \
bash scripts/run-qwen3-4B-amd.sh
AMD-Specific Configuration
The main differences between ROCm and NVIDIA training scripts:
Environment Variables
GPU Count Detection
Disable Gradient Fusion
# For AMD GPU - set these variables for Ray to function properly
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES = "1" # Must set to 1
export HIP_VISIBLE_DEVICES = "0,1,2,3,4,5,6,7" # Choose which GPUs to use
TODO : ROCm currently doesn’t support apex. You need to disable gradient accumulation fusion by adding the --no-gradient-accumulation-fusion flag in the training script. We will continue investigating how to enable this.
Complete Training Script
Here’s the complete training script adapted for AMD GPUs:
#!/bin/bash
# Cleanup previous runs
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
set -euxo pipefail
### AMD Support ###
SLIME_DIR = "${ SLIME_DIR :-/ home / yushensu / projects / slime }"
export SLIME_DIR
MODEL_DIR = "${ MODEL_DIR :-/ home / yushensu / projects / model }"
export MODEL_DIR
DATA_DIR = "${ DATA_DIR :-/ home / yushensu / projects / data }"
export DATA_DIR
# For AMD GPU
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES = "1" # Must set to 1
export HIP_VISIBLE_DEVICES = "0,1,2,3,4,5,6,7" # Choose which GPUs to use
####################
export PYTHONBUFFERED = 16
SCRIPT_DIR = "$( cd -- "$( dirname -- "${ BASH_SOURCE [0]}")" & > /dev/null && pwd )"
source "${ SCRIPT_DIR }/models/qwen3-4B.sh"
CKPT_ARGS = (
--hf-checkpoint ${ MODEL_DIR } /Qwen3-4B
--ref-load ${ MODEL_DIR } /Qwen3-4B_torch_dist
--load ${ MODEL_DIR } /Qwen3-4B_slime/
--save ${ MODEL_DIR } /Qwen3-4B_slime/
--save-interval 20
)
ROLLOUT_ARGS = (
--prompt-data ${ DATA_DIR } /dapo-math-17k/dapo-math-17k.jsonl
--input-key prompt
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type deepscaler
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 8192
--rollout-temperature 1
--global-batch-size 256
--balance-data
)
PERF_ARGS = (
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--recompute-granularity full
--recompute-method uniform
--recompute-num-layers 1
--use-dynamic-batch-size
--max-tokens-per-gpu 9216
)
MISC_ARGS = (
--attention-dropout 0.0
--hidden-dropout 0.0
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
--attention-backend flash
--no-gradient-accumulation-fusion # AMD: Need to add apex to enable this
)
# Launch Ray head node
export MASTER_ADDR = ${ MASTER_ADDR :- "127.0.0.1" }
NUM_GPUS = $( echo ${ HIP_VISIBLE_DEVICES } | tr ',' '\n' | wc -l )
ray start --head --node-ip-address ${ MASTER_ADDR } --num-gpus ${ NUM_GPUS } \
--disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265
MEGATRON_LM_PATH = $( pip list | grep megatron-core | awk '{print $NF}' )
ray job submit --address= "http://127.0.0.1:8265" \
--runtime-env-json= '{
"env_vars": {
"PYTHONPATH": "/workspace/Megatron-LM/",
"CUDA_DEVICE_MAX_CONNECTIONS": "1"
}
}' \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
${ MODEL_ARGS [ @ ]} \
${ CKPT_ARGS [ @ ]} \
${ ROLLOUT_ARGS [ @ ]} \
${ PERF_ARGS [ @ ]} \
${ MISC_ARGS [ @ ]}
# Cleanup after training
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
Key Differences from NVIDIA Setup
Aspect AMD (ROCm) NVIDIA (CUDA) Device visibility HIP_VISIBLE_DEVICESCUDA_VISIBLE_DEVICESRay configuration RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1Not required Gradient fusion --no-gradient-accumulation-fusion (apex not supported)Enabled by default Docker devices --device /dev/dri --device /dev/kfd--gpus allBase image ROCm 6.3.4 with virtual memory patch Standard CUDA images
Troubleshooting
Run the installation command again: cd slime
pip install -e . --no-deps
Ensure you’ve set the required environment variables: export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES = "1"
export HIP_VISIBLE_DEVICES = "0,1,2,3,4,5,6,7"
Gradient accumulation fusion errors
Make sure you’ve added the --no-gradient-accumulation-fusion flag to your training arguments. ROCm doesn’t currently support apex.
Supported Hardware
AMD Instinct MI300 series
AMD Instinct MI325 series
Additional Resources