Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
verl is designed to run on multiple hardware platforms through a unified plugin architecture. The primary development target is NVIDIA CUDA, with production-quality support for AMD ROCm and community-maintained support for Huawei Ascend NPUs. Additional platforms (Intel XPU, Cambricon MLU, MetaX) are supported via the external verl-hardware-plugin package as reference implementations.
NVIDIA GPUs
AMD ROCm
Ascend NPU
NVIDIA is verl’s primary development and testing platform. All features, backends, and algorithms are fully supported.Supported hardware: Any NVIDIA GPU with CUDA compute capability supported by CUDA ≥ 12.8| Backend | Status |
|---|
| FSDP | ✅ Full support |
| FSDP2 | ✅ Full support |
| Megatron-LM | ✅ Full support |
| vLLM rollout | ✅ Full support |
| SGLang rollout | ✅ Full support |
| Multi-turn / Agentic | ✅ Full support |
| LoRA (PEFT) | ✅ Full support |
| Expert parallelism (MoE) | ✅ Full support |
Quick Start
Follow the Installation guide for the standard installation. All example scripts in examples/ run on NVIDIA hardware without modification.Large-Scale Multi-Node Training
verl supports multi-node training via Ray clusters for models up to 671B parameters (DeepSeek-671B, Qwen3-235B) using expert parallelism and pipeline parallelism. Coordinate nodes by initializing a Ray head node and joining worker nodes before launching the training script. AMD ROCm (MI300X / MI325X / MI355X)
AMD ROCm support is production-ready for the colocate and fully async runtime modes. The recommended workflow is container-based.Validated hardware:
- MI300X / MI325X (
gfx942)
- MI355X (
gfx950)
| Backend | Status |
|---|
| FSDP | ✅ Supported |
| FSDP2 | ✅ Supported |
| Megatron-LM | ✅ Supported |
| vLLM rollout | ✅ Validated |
| SGLang rollout | 🔄 In progress |
SGLang rollout support on AMD ROCm is actively being developed. Use vLLM as the rollout backend until SGLang integration is completed.
Software Baseline
Use the prebuilt container for the validated software stack:amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312
Docker build recipe: docker/rocm/Dockerfile.rocmHost Prerequisites
Before launching the container, verify:
- AMD ROCm 7.0.2 host driver stack is installed and healthy
- Docker has access to
/dev/kfd and /dev/dri
- Dataset and model storage paths are mounted
Launch Container
NAME=verl_release
DOCKER=amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312
docker pull $DOCKER
docker run -it --name $NAME --device /dev/kfd --device /dev/dri \
--privileged --network=host \
--group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--shm-size=2048g \
--ulimit memlock=-1 --ulimit stack=67108864 \
-w /workspace \
$DOCKER \
/bin/bash
Environment Verification
Inside the container, confirm GPU detection and PyTorch ROCm integration:# Verify GPU targets
rocminfo | grep -E "gfx942|gfx950" || true
# PyTorch + ROCm sanity check
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("rocm :", torch.version.hip)
print("cuda_available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("gpu_count:", torch.cuda.device_count())
print("device_0:", torch.cuda.get_device_name(0))
PY
Training Examples
Colocate mode + FSDP (GRPO, Qwen3-8B)For Qwen3-8B on ROCm, enable parameter and optimizer offload to avoid OOM:# Recommended overrides for ROCm:
# actor_rollout_ref.actor.fsdp_config.param_offload=True
# actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh
Colocate mode + Megatron (GRPO, Qwen3.5-35B)bash examples/grpo_trainer/run_qwen3_5-35b-megatron.sh
Fully Async mode (DAPO, Qwen2.5-Math-7B)RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES and RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES are no longer required when using the recommended container (amdagi/verl-dev:rocm7.0.2_56_te2.10_vllm0.20_py312).
# Note: update max_position_embeddings to 32768 in config.json after download
bash verl/experimental/fully_async_policy/shell/dapo_7b_math_fsdp2_4_4.sh
Multi-Node Training with SLURM
For multi-node AMD ROCm training on SLURM clusters, use Docker or Podman containers with Ray:# Set environment variables before ray start
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# For Ray >= 2.45.0
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1
# For Ray < 2.45.0
# export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
Submit the full SLURM job script with sbatch slurm_script.sh. See docs/amd_tutorial/amd_build_dockerfile_page.rst for the complete multi-node SLURM script.Ascend NPUs
Ascend NPU support is community-maintained by the Huawei Ascend team. For issues specific to Ascend hardware, open a GitHub issue or consult the Ascend developer community. Validated hardware:
- Atlas 200T A2 Box16
- Atlas 900 A2 PODc
- Atlas 800T A3
verl automatically detects NPU hardware through the platform plugin system (verl.plugin.platform). GPU-targeted scripts generally run on Ascend without explicitly setting trainer.device=npu in recent releases.Supported Backends
| Backend | Status |
|---|
| FSDP | ✅ Supported (via torch_npu) |
| FSDP2 | ✅ Supported |
| Megatron-LM | ✅ Supported (via MindSpeed) |
| VeOmni | ✅ Supported (Ascend-specific, optimized for MoE) |
| vLLM rollout | ✅ Supported (via vllm-ascend plugin) |
| SGLang rollout | ✅ Supported |
Installation
Follow the Ascend installation guidance for environment setup, or use the Docker build guidance to get a prebuilt container.Activate the CANN environment before running any scripts:source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
Quick Start (Qwen3-0.6B GSM8K GRPO)
Prepare data and weights:# Download Qwen3-0.6B to ~/models/Qwen/Qwen3-0.6B
# Download GSM8K dataset, then:
python3 examples/data_preprocess/gsm8k.py --local_dataset_path /path/to/gsm8k/
Choose a backend combination and run the corresponding quick start script:| Combination | Training Backend | Rollout Backend | Script |
|---|
| vLLM + FSDP2 | FSDP2 | vLLM-Ascend | tests/special_npu/quick_start/run_qwen3_0_6b_fsdp2_vllm_ascend.sh |
| vLLM + Megatron | Megatron | vLLM-Ascend | tests/special_npu/quick_start/run_qwen3_0_6b_megatron_vllm_ascend.sh |
| SGLang + FSDP2 | FSDP2 | SGLang | tests/special_npu/quick_start/run_qwen3_0_6b_fsdp2_sglang_ascend.sh |
| SGLang + Megatron | Megatron | SGLang | tests/special_npu/quick_start/run_qwen3_0_6b_megatron_sglang_ascend.sh |
SGLang Backend Notes
To convert a vLLM-based script to SGLang on Ascend, add or modify these parameters:# Required
actor_rollout_ref.rollout.name=sglang \
+actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend" \
# Optional: expert parallelism for MoE models
++actor_rollout_ref.rollout.engine_kwargs.sglang.deepep_mode="auto" \
++actor_rollout_ref.rollout.engine_kwargs.sglang.moe_a2a_backend="deepep" \
# Required for MoE multi-DP
+actor_rollout_ref.rollout.engine_kwargs.sglang.enable_dp_attention=False \
# Chunked prefill disabled by default
+actor_rollout_ref.rollout.engine_kwargs.sglang.chunked_prefill_size=-1
VeOmni Backend (Ascend-Optimized)
VeOmni is a unified training backend built on FSDP, particularly optimized for large MoE models on Ascend:actor_rollout_ref.actor.strategy=veomni \
actor_rollout_ref.actor.veomni.fsdp_size=16 \
actor_rollout_ref.actor.veomni.expert_parallel_size=1 \
actor_rollout_ref.actor.veomni.attn_implementation=veomni_flash_attention_2_with_sp \
actor_rollout_ref.actor.veomni.moe_implementation=fused \
actor_rollout_ref.actor.veomni.param_offload=True \
actor_rollout_ref.actor.veomni.optimizer_offload=True
Additional Resources
Multi-Chip Plugin Architecture
verl uses a two-layer plugin system to abstract hardware differences:
Platform Plugin System (verl.plugin.platform) — hardware abstraction with auto-detection:
PlatformRegistry
├─ "nvidia" → PlatformCUDA (built-in)
├─ "huawei" → PlatformNPU (built-in)
├─ "intel" → PlatformXPU (verl-hardware-plugin)
├─ "cambricon" → PlatformMLU (verl-hardware-plugin)
└─ "metax" → PlatformMetaX (verl-hardware-plugin)
Engine Plugin System (verl.workers.engine.base) — chip-specific training engines:
EngineRegistry (device, vendor) → Engine class
├─ ("cuda", None) → FSDPEngineWithLMHead
├─ ("npu", None) → FSDPNPUEngineWithLMHead
├─ ("cuda", "metax") → FSDPMetaXEngineWithLMHead
├─ ("xpu", "intel") → FSDPXPUEngineWithLMHead
└─ ("mlu", "cambricon")→ FSDPMLUEngineWithLMHead
Auto-Detection and Override
Platform is auto-detected by probing is_available() on each registered platform. Override manually:
export VERL_PLATFORM=nvidia # or "huawei", "intel", "cambricon", "metax"
export VERL_ENGINE_DEVICE=cuda
export VERL_ENGINE_VENDOR=metax
Loading Plugins
Plugins are discovered through two mechanisms:
# Option 1: setuptools entry_points (after pip install)
# Auto-discovered from "verl.plugins" entry_points group
# Option 2: environment variable for development
export VERL_USE_EXTERNAL_MODULES=verl_hardware_plugin
Register a platform class with the @PlatformRegistry.register decorator:
from verl.plugin.platform import PlatformRegistry, PlatformBase
@PlatformRegistry.register(platform="my_vendor")
class PlatformMyDevice(PlatformBase):
@property
def device_name(self) -> str:
return "my_device" # torch device type string
@property
def vendor_name(self) -> str:
return "my_vendor"
Register a corresponding engine:
from verl.workers.engine.base import EngineRegistry
from verl.workers.engine.fsdp_engine import FSDPEngineWithLMHead
@EngineRegistry.register(
model_type="language_model",
backend=["fsdp", "fsdp2"],
device="my_device",
vendor="my_vendor",
)
class FSDPMyVendorEngineWithLMHead(FSDPEngineWithLMHead):
def initialize(self):
super().initialize()
# vendor-specific initialization
For a complete step-by-step guide, see the verl-hardware-plugin development guide.