verl Hardware Support: NVIDIA, AMD ROCm, and Ascend NPU

verl is designed to run on multiple hardware platforms through a unified plugin architecture. The primary development target is NVIDIA CUDA, with production-quality support for AMD ROCm and community-maintained support for Huawei Ascend NPUs. Additional platforms (Intel XPU, Cambricon MLU, MetaX) are supported via the external verl-hardware-plugin package as reference implementations.

NVIDIA GPUs
AMD ROCm
Ascend NPU

NVIDIA GPUs (Primary Platform)

NVIDIA is verl’s primary development and testing platform. All features, backends, and algorithms are fully supported.Supported hardware: Any NVIDIA GPU with CUDA compute capability supported by CUDA ≥ 12.8

Backend	Status
FSDP	✅ Full support
FSDP2	✅ Full support
Megatron-LM	✅ Full support
vLLM rollout	✅ Full support
SGLang rollout	✅ Full support
Multi-turn / Agentic	✅ Full support
LoRA (PEFT)	✅ Full support
Expert parallelism (MoE)	✅ Full support

Quick Start

Follow the Installation guide for the standard installation. All example scripts in examples/ run on NVIDIA hardware without modification.

Large-Scale Multi-Node Training

verl supports multi-node training via Ray clusters for models up to 671B parameters (DeepSeek-671B, Qwen3-235B) using expert parallelism and pipeline parallelism. Coordinate nodes by initializing a Ray head node and joining worker nodes before launching the training script.

AMD ROCm (MI300X / MI325X / MI355X)

AMD ROCm support is production-ready for the colocate and fully async runtime modes. The recommended workflow is container-based.Validated hardware:

MI300X / MI325X (gfx942)
MI355X (gfx950)

Backend	Status
FSDP	✅ Supported
FSDP2	✅ Supported
Megatron-LM	✅ Supported
vLLM rollout	✅ Validated
SGLang rollout	🔄 In progress

SGLang rollout support on AMD ROCm is actively being developed. Use vLLM as the rollout backend until SGLang integration is completed.

Software Baseline

Use the prebuilt container for the validated software stack:

amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312

Docker build recipe: docker/rocm/Dockerfile.rocm

Host Prerequisites

Before launching the container, verify:

AMD ROCm 7.0.2 host driver stack is installed and healthy
Docker has access to /dev/kfd and /dev/dri
Dataset and model storage paths are mounted

Launch Container

NAME=verl_release
DOCKER=amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312

docker pull $DOCKER

docker run -it --name $NAME --device /dev/kfd --device /dev/dri \
  --privileged --network=host \
  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
  --shm-size=2048g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -w /workspace \
  $DOCKER \
  /bin/bash

Environment Verification

Inside the container, confirm GPU detection and PyTorch ROCm integration:

# Verify GPU targets
rocminfo | grep -E "gfx942|gfx950" || true

# PyTorch + ROCm sanity check
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("rocm :", torch.version.hip)
print("cuda_available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("gpu_count:", torch.cuda.device_count())
    print("device_0:", torch.cuda.get_device_name(0))
PY

Training Examples

Colocate mode + FSDP (GRPO, Qwen3-8B)For Qwen3-8B on ROCm, enable parameter and optimizer offload to avoid OOM:

# Recommended overrides for ROCm:
# actor_rollout_ref.actor.fsdp_config.param_offload=True
# actor_rollout_ref.actor.fsdp_config.optimizer_offload=True

bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

Colocate mode + Megatron (GRPO, Qwen3.5-35B)

bash examples/grpo_trainer/run_qwen3_5-35b-megatron.sh

Fully Async mode (DAPO, Qwen2.5-Math-7B)

RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES and RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES are no longer required when using the recommended container (amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312).

# Note: update max_position_embeddings to 32768 in config.json after download
bash verl/experimental/fully_async_policy/shell/dapo_7b_math_fsdp2_4_4.sh

Multi-Node Training with SLURM

For multi-node AMD ROCm training on SLURM clusters, use Docker or Podman containers with Ray:

# Set environment variables before ray start
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# For Ray >= 2.45.0
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1

# For Ray < 2.45.0
# export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1

Submit the full SLURM job script with sbatch slurm_script.sh. See docs/amd_tutorial/amd_build_dockerfile_page.rst for the complete multi-node SLURM script.

Ascend NPUs

Ascend NPU support is community-maintained by the Huawei Ascend team. For issues specific to Ascend hardware, open a GitHub issue or consult the Ascend developer community.

Validated hardware:

Atlas 200T A2 Box16
Atlas 900 A2 PODc
Atlas 800T A3

verl automatically detects NPU hardware through the platform plugin system (verl.plugin.platform). GPU-targeted scripts generally run on Ascend without explicitly setting trainer.device=npu in recent releases.

Supported Backends

Backend	Status
FSDP	✅ Supported (via torch_npu)
FSDP2	✅ Supported
Megatron-LM	✅ Supported (via MindSpeed)
VeOmni	✅ Supported (Ascend-specific, optimized for MoE)
vLLM rollout	✅ Supported (via vllm-ascend plugin)
SGLang rollout	✅ Supported

Installation

Follow the Ascend installation guidance for environment setup, or use the Docker build guidance to get a prebuilt container.Activate the CANN environment before running any scripts:

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

Quick Start (Qwen3-0.6B GSM8K GRPO)

Prepare data and weights:

# Download Qwen3-0.6B to ~/models/Qwen/Qwen3-0.6B
# Download GSM8K dataset, then:
python3 examples/data_preprocess/gsm8k.py --local_dataset_path /path/to/gsm8k/

Choose a backend combination and run the corresponding quick start script:

Combination	Training Backend	Rollout Backend	Script
vLLM + FSDP2	FSDP2	vLLM-Ascend	`tests/special_npu/quick_start/run_qwen3_0_6b_fsdp2_vllm_ascend.sh`
vLLM + Megatron	Megatron	vLLM-Ascend	`tests/special_npu/quick_start/run_qwen3_0_6b_megatron_vllm_ascend.sh`
SGLang + FSDP2	FSDP2	SGLang	`tests/special_npu/quick_start/run_qwen3_0_6b_fsdp2_sglang_ascend.sh`
SGLang + Megatron	Megatron	SGLang	`tests/special_npu/quick_start/run_qwen3_0_6b_megatron_sglang_ascend.sh`

SGLang Backend Notes

To convert a vLLM-based script to SGLang on Ascend, add or modify these parameters:

# Required
actor_rollout_ref.rollout.name=sglang \
+actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend" \

# Optional: expert parallelism for MoE models
++actor_rollout_ref.rollout.engine_kwargs.sglang.deepep_mode="auto" \
++actor_rollout_ref.rollout.engine_kwargs.sglang.moe_a2a_backend="deepep" \

# Required for MoE multi-DP
+actor_rollout_ref.rollout.engine_kwargs.sglang.enable_dp_attention=False \

# Chunked prefill disabled by default
+actor_rollout_ref.rollout.engine_kwargs.sglang.chunked_prefill_size=-1

VeOmni Backend (Ascend-Optimized)

VeOmni is a unified training backend built on FSDP, particularly optimized for large MoE models on Ascend:

actor_rollout_ref.actor.strategy=veomni \
actor_rollout_ref.actor.veomni.fsdp_size=16 \
actor_rollout_ref.actor.veomni.expert_parallel_size=1 \
actor_rollout_ref.actor.veomni.attn_implementation=veomni_flash_attention_2_with_sp \
actor_rollout_ref.actor.veomni.moe_implementation=fused \
actor_rollout_ref.actor.veomni.param_offload=True \
actor_rollout_ref.actor.veomni.optimizer_offload=True

Additional Resources

Multi-Chip Plugin Architecture

verl uses a two-layer plugin system to abstract hardware differences: Platform Plugin System (verl.plugin.platform) — hardware abstraction with auto-detection:

PlatformRegistry
  ├─ "nvidia"    → PlatformCUDA      (built-in)
  ├─ "huawei"    → PlatformNPU       (built-in)
  ├─ "intel"     → PlatformXPU       (verl-hardware-plugin)
  ├─ "cambricon" → PlatformMLU       (verl-hardware-plugin)
  └─ "metax"     → PlatformMetaX     (verl-hardware-plugin)

Engine Plugin System (verl.workers.engine.base) — chip-specific training engines:

EngineRegistry  (device, vendor) → Engine class
  ├─ ("cuda", None)      → FSDPEngineWithLMHead
  ├─ ("npu", None)       → FSDPNPUEngineWithLMHead
  ├─ ("cuda", "metax")   → FSDPMetaXEngineWithLMHead
  ├─ ("xpu", "intel")    → FSDPXPUEngineWithLMHead
  └─ ("mlu", "cambricon")→ FSDPMLUEngineWithLMHead

Auto-Detection and Override

Platform is auto-detected by probing is_available() on each registered platform. Override manually:

export VERL_PLATFORM=nvidia  # or "huawei", "intel", "cambricon", "metax"
export VERL_ENGINE_DEVICE=cuda
export VERL_ENGINE_VENDOR=metax

Loading Plugins

Plugins are discovered through two mechanisms:

# Option 1: setuptools entry_points (after pip install)
# Auto-discovered from "verl.plugins" entry_points group

# Option 2: environment variable for development
export VERL_USE_EXTERNAL_MODULES=verl_hardware_plugin

Adding a New Hardware Platform

from verl.plugin.platform import PlatformRegistry, PlatformBase

@PlatformRegistry.register(platform="my_vendor")
class PlatformMyDevice(PlatformBase):
    @property
    def device_name(self) -> str:
        return "my_device"  # torch device type string

    @property
    def vendor_name(self) -> str:
        return "my_vendor"

from verl.workers.engine.base import EngineRegistry
from verl.workers.engine.fsdp_engine import FSDPEngineWithLMHead

@EngineRegistry.register(
    model_type="language_model",
    backend=["fsdp", "fsdp2"],
    device="my_device",
    vendor="my_vendor",
)
class FSDPMyVendorEngineWithLMHead(FSDPEngineWithLMHead):
    def initialize(self):
        super().initialize()
        # vendor-specific initialization

For a complete step-by-step guide, see the verl-hardware-plugin development guide.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

verl Hardware Support: NVIDIA, AMD ROCm, and Ascend NPU

NVIDIA GPUs (Primary Platform)

Quick Start

Large-Scale Multi-Node Training

AMD ROCm (MI300X / MI325X / MI355X)

Software Baseline

Host Prerequisites

Launch Container

Environment Verification

Training Examples

Multi-Node Training with SLURM

Ascend NPUs

Supported Backends

Installation

Quick Start (Qwen3-0.6B GSM8K GRPO)

SGLang Backend Notes

VeOmni Backend (Ascend-Optimized)

Additional Resources

Multi-Chip Plugin Architecture

Auto-Detection and Override

Loading Plugins

Adding a New Hardware Platform

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​NVIDIA GPUs (Primary Platform)

​Quick Start

​Large-Scale Multi-Node Training

​AMD ROCm (MI300X / MI325X / MI355X)

​Software Baseline

​Host Prerequisites

​Launch Container

​Environment Verification

​Training Examples

​Multi-Node Training with SLURM

​Ascend NPUs

​Supported Backends

​Installation

​Quick Start (Qwen3-0.6B GSM8K GRPO)

​SGLang Backend Notes

​VeOmni Backend (Ascend-Optimized)

​Additional Resources

​Multi-Chip Plugin Architecture

​Auto-Detection and Override

​Loading Plugins

​Adding a New Hardware Platform

Build docs developers (and LLMs) love

NVIDIA GPUs (Primary Platform)

Quick Start

Large-Scale Multi-Node Training

AMD ROCm (MI300X / MI325X / MI355X)

Software Baseline

Host Prerequisites

Launch Container

Environment Verification

Training Examples

Multi-Node Training with SLURM

Ascend NPUs

Supported Backends

Installation

Quick Start (Qwen3-0.6B GSM8K GRPO)

SGLang Backend Notes

VeOmni Backend (Ascend-Optimized)

Additional Resources

Multi-Chip Plugin Architecture

Auto-Detection and Override

Loading Plugins

Adding a New Hardware Platform