Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
verl uses Ray as its distributed execution backend, which makes scaling from a single node to a full cluster a matter of pointing the trainer at an existing Ray cluster and setting the trainer.nnodes and trainer.n_gpus_per_node config values. This page covers four ways to bring up a multi-node cluster: manual setup, SkyPilot on cloud or Kubernetes, Slurm, and dstack.
All nodes in the cluster must be able to reach each other over the network. Placing them on the same subnet (or using a high-bandwidth interconnect such as InfiniBand) is strongly recommended. Port 6379 (Ray GCS) and 8265 (Ray dashboard) must be open between nodes.
Option 1: Launch Manually
This is the simplest approach when you have direct SSH access to a fixed set of machines.
Start the head node
On the machine you designate as the head node, start Ray with the dashboard exposed:ray start --head --dashboard-host=0.0.0.0
Ray will print two addresses:
- GCS address (e.g.,
10.0.0.1:6379) — worker nodes use this to join the cluster
- Dashboard address (e.g.,
10.0.0.1:8265) — use this to submit jobs and inspect the cluster in a browser
Join worker nodes
On each additional machine, connect to the head node using the GCS address printed above:ray start --address='<head-node-ip>:6379'
After all workers have joined, verify the cluster from the head node:You should see all nodes listed with their GPU and CPU resources. Submit a training job
Submit the verl training job to the cluster using the Ray Jobs API. Pass the dashboard address with --address:ray job submit --address="http://127.0.0.1:8265" \
--runtime-env=verl/trainer/runtime_env.yaml \
--no-wait \
-- \
python3 -m verl.trainer.main_ppo \
data.train_files=/path/to/train.parquet \
data.val_files=/path/to/val.parquet \
data.train_batch_size=1024 \
data.max_prompt_length=512 \
data.max_response_length=512 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
trainer.total_epochs=15
Key multi-node config fields:| Config key | Description |
|---|
trainer.nnodes | Total number of nodes in the cluster |
trainer.n_gpus_per_node | Number of GPUs per node |
Monitor and manage the job
After submission you receive a Submission ID. Use these commands to manage and inspect the job:# List all submitted jobs
ray job list
# Stream logs continuously
ray job logs <Submission ID> --follow
# Check job status
ray job status <Submission ID>
# Stop a running job
ray job stop <Submission ID>
Driver and task logs are also written to /tmp/ray/session_latest/logs/. The driver log is named job-driver-raysubmit_<Submission ID>.log.The Ray dashboard at <head-node-ip>:8265 provides the most structured view of job status, actor placement, GPU utilization, and timelines. Use it for multi-node debugging — it is far more convenient than raw log files.
Option 2: Launch via SkyPilot (Cloud or Kubernetes)
SkyPilot abstracts cloud provisioning and cluster management. Ready-to-use configurations for verl are in examples/tutorial/skypilot/:
verl-ppo.yaml — PPO training with GSM8K
verl-grpo.yaml — GRPO training with the MATH dataset
verl-multiturn-tools.yaml — Multi-turn tool usage training
Install SkyPilot and configure your cloud
The example below uses GCP. Adapt the pip install and credential steps for your cloud:conda create -y -n sky python=3.10
conda activate sky
pip install "skypilot[gcp]"
conda install -c conda-forge google-cloud-sdk
gcloud init
gcloud auth application-default login
# Verify credentials
sky check gcp
Prepare the dataset locally
git clone https://github.com/verl-project/verl.git
cd examples/data_preprocess
python3 gsm8k.py --local_save_dir ~/data/gsm8k
Create a SkyPilot YAML and launch
Create verl-cluster.yml:resources:
accelerators: L4:1 # one L4 GPU per node
image_id: docker:verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4
memory: 64+
ports: 8265 # expose Ray dashboard
num_nodes: 2
workdir: . # sync local directory to remote cluster
secrets:
WANDB_API_KEY: null
file_mounts:
data/gsm8k: ~/data/gsm8k
setup: |
rm -rf verl
git clone https://github.com/verl-project/verl.git
cd verl
pip3 install -v -e .[vllm]
run: |
head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
ray start --head --disable-usage-stats \
--port=6379 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265
while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do
echo "Waiting for all nodes... ($(ray nodes | grep NODE_ID | wc -l)/$num_nodes)"
sleep 5
done
python3 -m verl.trainer.main_ppo \
data.train_files=data/gsm8k/train.parquet \
data.val_files=data/gsm8k/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=256 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['console','wandb'] \
trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=1 \
trainer.nnodes=2 \
trainer.save_freq=20 \
trainer.test_freq=20 \
trainer.total_epochs=2 \
trainer.project_name=verl_examples \
trainer.experiment_name=experiment_name_gsm8k
else
sleep 10
ray start --address $head_ip:6379 --disable-usage-stats
sleep 5
fi
Launch the cluster:export WANDB_API_KEY=<your-wandb-api-key>
sky launch -c verl --secret WANDB_API_KEY verl-cluster.yml
While sky launch is attached, the Ray dashboard is forwarded to localhost:8265.To get the dashboard URL after the fact:sky status --endpoint 8265 verl
To inspect checkpoints on the head node:ssh verl
ls ~/sky_workdir/checkpoints/verl_examples/
Option 3: Launch via Slurm
Ray provides an official Slurm user guide. verl has been verified on multi-node Slurm clusters. A ready-to-use job script is at examples/tutorial/slurm/ray_on_slurm.slurm.
(Optional) Convert the Docker image to Apptainer/Singularity
If your cluster supports Apptainer or Singularity:apptainer pull /your/dest/dir/verl.sif \
docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
Alternatively, use the container runtime supported by your cluster (e.g., via Slurm’s OCI support). Prepare dataset and model
Follow Step 1 and Step 2 of the Quickstart guide to produce the Parquet files and download your model checkpoint on a node with storage access. Modify and submit the job script
Edit examples/tutorial/slurm/ray_on_slurm.slurm with your cluster’s node names, GPU counts, and paths. Then submit:sbatch ray_on_slurm.slurm
The script handles Ray head and worker startup automatically using SLURM_JOB_NODELIST and SLURM_NODE_RANK environment variables.Slurm cluster configurations vary significantly. If you encounter issues, refer to Ray’s Slurm community guide for common caveats around environment variables and resource specifications.
Option 4: Launch via dstack
dstack is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments without requiring Kubernetes or Slurm.
Install dstack and initialize a project
After installing dstack, initialize your working directory:mkdir myproject && cd myproject
dstack init
Then create a fleet (a pool of reserved instances) through the dstack CLI or UI before submitting jobs. Define a Ray cluster task
Create ray-cluster.dstack.yml:type: task
name: ray-verl-cluster
nodes: 2
env:
- WANDB_API_KEY
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
image: verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
commands:
- git clone https://github.com/verl-project/verl
- cd verl
- pip install --no-deps -e .
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')"
ray start --head --port=6379
else
ray start --address=$DSTACK_MASTER_NODE_IP:6379
fi
ports:
- 8265
resources:
gpu: 80GB:8
shm_size: 128GB
volumes:
- /checkpoints:/checkpoints
Apply the task — dstack will provision nodes and forward the Ray dashboard to localhost:8265:dstack apply -f ray-cluster.dstack.yml
Submit a training job to the cluster
While dstack apply is attached, submit a Ray job to the forwarded dashboard port:RAY_ADDRESS=http://localhost:8265
ray job submit \
-- python3 -m verl.trainer.main_ppo \
data.train_files=/root/data/gsm8k/train.parquet \
data.val_files=/root/data/gsm8k/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=256 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.project_name=ppo_training \
trainer.experiment_name=qwen-2.5-7B \
trainer.val_before_train=False \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
trainer.default_local_dir=/checkpoints \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Debugging Distributed Jobs
Ray Distributed Debugger (Recommended — Ray 2.39+)
Starting with Ray 2.39, Anyscale provides the Ray Distributed Debugger VSCode extension, which allows you to set breakpoints inside remote Ray tasks.
Prerequisites:
- Visual Studio Code
ray[default] >= 2.9.1
debugpy >= 1.8.0
Setup steps:
-
Set the environment variable to enable post-mortem debugging:
export RAY_DEBUG_POST_MORTEM=1
Remove any legacy debug flags before starting Ray: RAY_DEBUG=legacy and --ray-debugger-external are incompatible with the new debugger.
-
Add
breakpoint() calls inside your @ray.remote functions. Breakpoints only work inside remote functions.
-
Run your job directly from the command line (do not use a
launch.json):
-
When the process hits a
breakpoint(), click the Ray Distributed Debugger icon in the VSCode sidebar to attach.
-
For each subsequent breakpoint, disconnect the current session and click the icon again to attach to the next one.
Legacy Ray Debugger
For older Ray versions or environments where the VSCode extension is not available:
-
Start the cluster with legacy debug flags:
# Head node
RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external
# Worker nodes
RAY_DEBUG=legacy ray start --address='<head-node-ip>:6379' --ray-debugger-external
-
Add
breakpoint() calls to your code, submit the job, then run:
This command blocks until a breakpoint is hit, then drops you into a pdb-style REPL.