Multi-Node Training with verl and Ray

verl uses Ray as its distributed execution backend, which makes scaling from a single node to a full cluster a matter of pointing the trainer at an existing Ray cluster and setting the trainer.nnodes and trainer.n_gpus_per_node config values. This page covers four ways to bring up a multi-node cluster: manual setup, SkyPilot on cloud or Kubernetes, Slurm, and dstack.

All nodes in the cluster must be able to reach each other over the network. Placing them on the same subnet (or using a high-bandwidth interconnect such as InfiniBand) is strongly recommended. Port 6379 (Ray GCS) and 8265 (Ray dashboard) must be open between nodes.

Option 1: Launch Manually

This is the simplest approach when you have direct SSH access to a fixed set of machines.

Start the head node

On the machine you designate as the head node, start Ray with the dashboard exposed:

ray start --head --dashboard-host=0.0.0.0

Ray will print two addresses:

GCS address (e.g., 10.0.0.1:6379) — worker nodes use this to join the cluster
Dashboard address (e.g., 10.0.0.1:8265) — use this to submit jobs and inspect the cluster in a browser

Join worker nodes

On each additional machine, connect to the head node using the GCS address printed above:

ray start --address='<head-node-ip>:6379'

After all workers have joined, verify the cluster from the head node:

ray status

You should see all nodes listed with their GPU and CPU resources.

Submit a training job

Submit the verl training job to the cluster using the Ray Jobs API. Pass the dashboard address with --address:

ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env=verl/trainer/runtime_env.yaml \
    --no-wait \
    -- \
    python3 -m verl.trainer.main_ppo \
        data.train_files=/path/to/train.parquet \
        data.val_files=/path/to/val.parquet \
        data.train_batch_size=1024 \
        data.max_prompt_length=512 \
        data.max_response_length=512 \
        actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
        actor_rollout_ref.actor.optim.lr=1e-6 \
        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
        actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
        critic.optim.lr=1e-5 \
        critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
        algorithm.kl_ctrl.kl_coef=0.001 \
        trainer.n_gpus_per_node=8 \
        trainer.nnodes=2 \
        trainer.total_epochs=15

Key multi-node config fields:

Config key	Description
`trainer.nnodes`	Total number of nodes in the cluster
`trainer.n_gpus_per_node`	Number of GPUs per node

Monitor and manage the job

After submission you receive a Submission ID. Use these commands to manage and inspect the job:

# List all submitted jobs
ray job list

# Stream logs continuously
ray job logs <Submission ID> --follow

# Check job status
ray job status <Submission ID>

# Stop a running job
ray job stop <Submission ID>

Driver and task logs are also written to /tmp/ray/session_latest/logs/. The driver log is named job-driver-raysubmit_<Submission ID>.log.

The Ray dashboard at <head-node-ip>:8265 provides the most structured view of job status, actor placement, GPU utilization, and timelines. Use it for multi-node debugging — it is far more convenient than raw log files.

Option 2: Launch via SkyPilot (Cloud or Kubernetes)

SkyPilot abstracts cloud provisioning and cluster management. Ready-to-use configurations for verl are in examples/tutorial/skypilot/:

verl-ppo.yaml — PPO training with GSM8K
verl-grpo.yaml — GRPO training with the MATH dataset
verl-multiturn-tools.yaml — Multi-turn tool usage training

Install SkyPilot and configure your cloud

The example below uses GCP. Adapt the pip install and credential steps for your cloud:

conda create -y -n sky python=3.10
conda activate sky
pip install "skypilot[gcp]"

conda install -c conda-forge google-cloud-sdk
gcloud init
gcloud auth application-default login

# Verify credentials
sky check gcp

Prepare the dataset locally

git clone https://github.com/verl-project/verl.git
cd examples/data_preprocess
python3 gsm8k.py --local_save_dir ~/data/gsm8k

Create a SkyPilot YAML and launch

Create verl-cluster.yml:

resources:
  accelerators: L4:1       # one L4 GPU per node
  image_id: docker:verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4
  memory: 64+
  ports: 8265              # expose Ray dashboard

num_nodes: 2

workdir: .                 # sync local directory to remote cluster

secrets:
  WANDB_API_KEY: null

file_mounts:
  data/gsm8k: ~/data/gsm8k

setup: |
  rm -rf verl
  git clone https://github.com/verl-project/verl.git
  cd verl
  pip3 install -v -e .[vllm]

run: |
  head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
  num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`

  python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"

  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    ray start --head --disable-usage-stats \
          --port=6379 \
          --dashboard-host=0.0.0.0 \
          --dashboard-port=8265

    while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do
      echo "Waiting for all nodes... ($(ray nodes | grep NODE_ID | wc -l)/$num_nodes)"
      sleep 5
    done

    python3 -m verl.trainer.main_ppo \
      data.train_files=data/gsm8k/train.parquet \
      data.val_files=data/gsm8k/test.parquet \
      data.train_batch_size=256 \
      data.max_prompt_length=512 \
      data.max_response_length=256 \
      actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
      actor_rollout_ref.actor.optim.lr=1e-6 \
      actor_rollout_ref.actor.ppo_mini_batch_size=64 \
      actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
      actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
      actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
      actor_rollout_ref.rollout.name=vllm \
      actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
      actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
      critic.optim.lr=1e-5 \
      critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
      critic.ppo_micro_batch_size_per_gpu=4 \
      algorithm.kl_ctrl.kl_coef=0.001 \
      trainer.logger=['console','wandb'] \
      trainer.val_before_train=False \
      trainer.default_hdfs_dir=null \
      trainer.n_gpus_per_node=1 \
      trainer.nnodes=2 \
      trainer.save_freq=20 \
      trainer.test_freq=20 \
      trainer.total_epochs=2 \
      trainer.project_name=verl_examples \
      trainer.experiment_name=experiment_name_gsm8k
  else
    sleep 10
    ray start --address $head_ip:6379 --disable-usage-stats
    sleep 5
  fi

Launch the cluster:

export WANDB_API_KEY=<your-wandb-api-key>
sky launch -c verl --secret WANDB_API_KEY verl-cluster.yml

While sky launch is attached, the Ray dashboard is forwarded to localhost:8265.To get the dashboard URL after the fact:

sky status --endpoint 8265 verl

To inspect checkpoints on the head node:

ssh verl
ls ~/sky_workdir/checkpoints/verl_examples/

Option 3: Launch via Slurm

Ray provides an official Slurm user guide. verl has been verified on multi-node Slurm clusters. A ready-to-use job script is at examples/tutorial/slurm/ray_on_slurm.slurm.

(Optional) Convert the Docker image to Apptainer/Singularity

If your cluster supports Apptainer or Singularity:

apptainer pull /your/dest/dir/verl.sif \
    docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3

Alternatively, use the container runtime supported by your cluster (e.g., via Slurm’s OCI support).

Prepare dataset and model

Follow Step 1 and Step 2 of the Quickstart guide to produce the Parquet files and download your model checkpoint on a node with storage access.

Modify and submit the job script

Edit examples/tutorial/slurm/ray_on_slurm.slurm with your cluster’s node names, GPU counts, and paths. Then submit:

sbatch ray_on_slurm.slurm

The script handles Ray head and worker startup automatically using SLURM_JOB_NODELIST and SLURM_NODE_RANK environment variables.

Slurm cluster configurations vary significantly. If you encounter issues, refer to Ray’s Slurm community guide for common caveats around environment variables and resource specifications.

Option 4: Launch via dstack

dstack is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments without requiring Kubernetes or Slurm.

Install dstack and initialize a project

After installing dstack, initialize your working directory:

mkdir myproject && cd myproject
dstack init

Then create a fleet (a pool of reserved instances) through the dstack CLI or UI before submitting jobs.

Define a Ray cluster task

Create ray-cluster.dstack.yml:

type: task
name: ray-verl-cluster

nodes: 2

env:
  - WANDB_API_KEY
  - PYTHONUNBUFFERED=1
  - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

image: verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
commands:
  - git clone https://github.com/verl-project/verl
  - cd verl
  - pip install --no-deps -e .
  - |
    if [ $DSTACK_NODE_RANK = 0 ]; then
      python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
      python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')"
      ray start --head --port=6379
    else
      ray start --address=$DSTACK_MASTER_NODE_IP:6379
    fi

ports:
  - 8265

resources:
  gpu: 80GB:8
  shm_size: 128GB

volumes:
  - /checkpoints:/checkpoints

Apply the task — dstack will provision nodes and forward the Ray dashboard to localhost:8265:

dstack apply -f ray-cluster.dstack.yml

Submit a training job to the cluster

While dstack apply is attached, submit a Ray job to the forwarded dashboard port:

RAY_ADDRESS=http://localhost:8265
ray job submit \
    -- python3 -m verl.trainer.main_ppo \
    data.train_files=/root/data/gsm8k/train.parquet \
    data.val_files=/root/data/gsm8k/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=256 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    critic.optim.lr=1e-5 \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    critic.ppo_micro_batch_size_per_gpu=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.project_name=ppo_training \
    trainer.experiment_name=qwen-2.5-7B \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.default_local_dir=/checkpoints \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

Debugging Distributed Jobs

Ray Distributed Debugger (Recommended — Ray 2.39+)

Starting with Ray 2.39, Anyscale provides the Ray Distributed Debugger VSCode extension, which allows you to set breakpoints inside remote Ray tasks. Prerequisites:

Visual Studio Code
ray[default] >= 2.9.1
debugpy >= 1.8.0

Setup steps:

Set the environment variable to enable post-mortem debugging:
```
export RAY_DEBUG_POST_MORTEM=1
```
Remove any legacy debug flags before starting Ray: RAY_DEBUG=legacy and --ray-debugger-external are incompatible with the new debugger.
Add breakpoint() calls inside your @ray.remote functions. Breakpoints only work inside remote functions.
Run your job directly from the command line (do not use a launch.json):
```
python job.py
```
When the process hits a breakpoint(), click the Ray Distributed Debugger icon in the VSCode sidebar to attach.
For each subsequent breakpoint, disconnect the current session and click the icon again to attach to the next one.

Legacy Ray Debugger

For older Ray versions or environments where the VSCode extension is not available:

Start the cluster with legacy debug flags:

# Head node
RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external

# Worker nodes
RAY_DEBUG=legacy ray start --address='<head-node-ip>:6379' --ray-debugger-external

Add breakpoint() calls to your code, submit the job, then run:
```
ray debug
```
This command blocks until a breakpoint is hit, then drops you into a pdb-style REPL.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Multi-Node Training with verl and Ray

Option 1: Launch Manually

Option 2: Launch via SkyPilot (Cloud or Kubernetes)

Option 3: Launch via Slurm

Option 4: Launch via dstack

Debugging Distributed Jobs

Ray Distributed Debugger (Recommended — Ray 2.39+)

Legacy Ray Debugger

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Option 1: Launch Manually

​Option 2: Launch via SkyPilot (Cloud or Kubernetes)

​Option 3: Launch via Slurm

​Option 4: Launch via dstack

​Debugging Distributed Jobs

​Ray Distributed Debugger (Recommended — Ray 2.39+)

​Legacy Ray Debugger

Build docs developers (and LLMs) love

Option 1: Launch Manually

Option 2: Launch via SkyPilot (Cloud or Kubernetes)

Option 3: Launch via Slurm

Option 4: Launch via dstack

Debugging Distributed Jobs

Ray Distributed Debugger (Recommended — Ray 2.39+)

Legacy Ray Debugger