Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl uses Ray as its distributed execution backend, which makes scaling from a single node to a full cluster a matter of pointing the trainer at an existing Ray cluster and setting the trainer.nnodes and trainer.n_gpus_per_node config values. This page covers four ways to bring up a multi-node cluster: manual setup, SkyPilot on cloud or Kubernetes, Slurm, and dstack.
All nodes in the cluster must be able to reach each other over the network. Placing them on the same subnet (or using a high-bandwidth interconnect such as InfiniBand) is strongly recommended. Port 6379 (Ray GCS) and 8265 (Ray dashboard) must be open between nodes.

Option 1: Launch Manually

This is the simplest approach when you have direct SSH access to a fixed set of machines.
1

Start the head node

On the machine you designate as the head node, start Ray with the dashboard exposed:
ray start --head --dashboard-host=0.0.0.0
Ray will print two addresses:
  • GCS address (e.g., 10.0.0.1:6379) — worker nodes use this to join the cluster
  • Dashboard address (e.g., 10.0.0.1:8265) — use this to submit jobs and inspect the cluster in a browser
2

Join worker nodes

On each additional machine, connect to the head node using the GCS address printed above:
ray start --address='<head-node-ip>:6379'
After all workers have joined, verify the cluster from the head node:
ray status
You should see all nodes listed with their GPU and CPU resources.
3

Submit a training job

Submit the verl training job to the cluster using the Ray Jobs API. Pass the dashboard address with --address:
ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env=verl/trainer/runtime_env.yaml \
    --no-wait \
    -- \
    python3 -m verl.trainer.main_ppo \
        data.train_files=/path/to/train.parquet \
        data.val_files=/path/to/val.parquet \
        data.train_batch_size=1024 \
        data.max_prompt_length=512 \
        data.max_response_length=512 \
        actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
        actor_rollout_ref.actor.optim.lr=1e-6 \
        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
        actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
        critic.optim.lr=1e-5 \
        critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
        algorithm.kl_ctrl.kl_coef=0.001 \
        trainer.n_gpus_per_node=8 \
        trainer.nnodes=2 \
        trainer.total_epochs=15
Key multi-node config fields:
Config keyDescription
trainer.nnodesTotal number of nodes in the cluster
trainer.n_gpus_per_nodeNumber of GPUs per node
4

Monitor and manage the job

After submission you receive a Submission ID. Use these commands to manage and inspect the job:
# List all submitted jobs
ray job list

# Stream logs continuously
ray job logs <Submission ID> --follow

# Check job status
ray job status <Submission ID>

# Stop a running job
ray job stop <Submission ID>
Driver and task logs are also written to /tmp/ray/session_latest/logs/. The driver log is named job-driver-raysubmit_<Submission ID>.log.
The Ray dashboard at <head-node-ip>:8265 provides the most structured view of job status, actor placement, GPU utilization, and timelines. Use it for multi-node debugging — it is far more convenient than raw log files.

Option 2: Launch via SkyPilot (Cloud or Kubernetes)

SkyPilot abstracts cloud provisioning and cluster management. Ready-to-use configurations for verl are in examples/tutorial/skypilot/:
  • verl-ppo.yaml — PPO training with GSM8K
  • verl-grpo.yaml — GRPO training with the MATH dataset
  • verl-multiturn-tools.yaml — Multi-turn tool usage training
1

Install SkyPilot and configure your cloud

The example below uses GCP. Adapt the pip install and credential steps for your cloud:
conda create -y -n sky python=3.10
conda activate sky
pip install "skypilot[gcp]"

conda install -c conda-forge google-cloud-sdk
gcloud init
gcloud auth application-default login

# Verify credentials
sky check gcp
2

Prepare the dataset locally

git clone https://github.com/verl-project/verl.git
cd examples/data_preprocess
python3 gsm8k.py --local_save_dir ~/data/gsm8k
3

Create a SkyPilot YAML and launch

Create verl-cluster.yml:
resources:
  accelerators: L4:1       # one L4 GPU per node
  image_id: docker:verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4
  memory: 64+
  ports: 8265              # expose Ray dashboard

num_nodes: 2

workdir: .                 # sync local directory to remote cluster

secrets:
  WANDB_API_KEY: null

file_mounts:
  data/gsm8k: ~/data/gsm8k

setup: |
  rm -rf verl
  git clone https://github.com/verl-project/verl.git
  cd verl
  pip3 install -v -e .[vllm]

run: |
  head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
  num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`

  python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"

  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    ray start --head --disable-usage-stats \
          --port=6379 \
          --dashboard-host=0.0.0.0 \
          --dashboard-port=8265

    while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do
      echo "Waiting for all nodes... ($(ray nodes | grep NODE_ID | wc -l)/$num_nodes)"
      sleep 5
    done

    python3 -m verl.trainer.main_ppo \
      data.train_files=data/gsm8k/train.parquet \
      data.val_files=data/gsm8k/test.parquet \
      data.train_batch_size=256 \
      data.max_prompt_length=512 \
      data.max_response_length=256 \
      actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
      actor_rollout_ref.actor.optim.lr=1e-6 \
      actor_rollout_ref.actor.ppo_mini_batch_size=64 \
      actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
      actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
      actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
      actor_rollout_ref.rollout.name=vllm \
      actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
      actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
      critic.optim.lr=1e-5 \
      critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
      critic.ppo_micro_batch_size_per_gpu=4 \
      algorithm.kl_ctrl.kl_coef=0.001 \
      trainer.logger=['console','wandb'] \
      trainer.val_before_train=False \
      trainer.default_hdfs_dir=null \
      trainer.n_gpus_per_node=1 \
      trainer.nnodes=2 \
      trainer.save_freq=20 \
      trainer.test_freq=20 \
      trainer.total_epochs=2 \
      trainer.project_name=verl_examples \
      trainer.experiment_name=experiment_name_gsm8k
  else
    sleep 10
    ray start --address $head_ip:6379 --disable-usage-stats
    sleep 5
  fi
Launch the cluster:
export WANDB_API_KEY=<your-wandb-api-key>
sky launch -c verl --secret WANDB_API_KEY verl-cluster.yml
While sky launch is attached, the Ray dashboard is forwarded to localhost:8265.To get the dashboard URL after the fact:
sky status --endpoint 8265 verl
To inspect checkpoints on the head node:
ssh verl
ls ~/sky_workdir/checkpoints/verl_examples/

Option 3: Launch via Slurm

Ray provides an official Slurm user guide. verl has been verified on multi-node Slurm clusters. A ready-to-use job script is at examples/tutorial/slurm/ray_on_slurm.slurm.
1

(Optional) Convert the Docker image to Apptainer/Singularity

If your cluster supports Apptainer or Singularity:
apptainer pull /your/dest/dir/verl.sif \
    docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
Alternatively, use the container runtime supported by your cluster (e.g., via Slurm’s OCI support).
2

Prepare dataset and model

Follow Step 1 and Step 2 of the Quickstart guide to produce the Parquet files and download your model checkpoint on a node with storage access.
3

Modify and submit the job script

Edit examples/tutorial/slurm/ray_on_slurm.slurm with your cluster’s node names, GPU counts, and paths. Then submit:
sbatch ray_on_slurm.slurm
The script handles Ray head and worker startup automatically using SLURM_JOB_NODELIST and SLURM_NODE_RANK environment variables.
Slurm cluster configurations vary significantly. If you encounter issues, refer to Ray’s Slurm community guide for common caveats around environment variables and resource specifications.

Option 4: Launch via dstack

dstack is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments without requiring Kubernetes or Slurm.
1

Install dstack and initialize a project

After installing dstack, initialize your working directory:
mkdir myproject && cd myproject
dstack init
Then create a fleet (a pool of reserved instances) through the dstack CLI or UI before submitting jobs.
2

Define a Ray cluster task

Create ray-cluster.dstack.yml:
type: task
name: ray-verl-cluster

nodes: 2

env:
  - WANDB_API_KEY
  - PYTHONUNBUFFERED=1
  - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

image: verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
commands:
  - git clone https://github.com/verl-project/verl
  - cd verl
  - pip install --no-deps -e .
  - |
    if [ $DSTACK_NODE_RANK = 0 ]; then
      python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
      python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')"
      ray start --head --port=6379
    else
      ray start --address=$DSTACK_MASTER_NODE_IP:6379
    fi

ports:
  - 8265

resources:
  gpu: 80GB:8
  shm_size: 128GB

volumes:
  - /checkpoints:/checkpoints
Apply the task — dstack will provision nodes and forward the Ray dashboard to localhost:8265:
dstack apply -f ray-cluster.dstack.yml
3

Submit a training job to the cluster

While dstack apply is attached, submit a Ray job to the forwarded dashboard port:
RAY_ADDRESS=http://localhost:8265
ray job submit \
    -- python3 -m verl.trainer.main_ppo \
    data.train_files=/root/data/gsm8k/train.parquet \
    data.val_files=/root/data/gsm8k/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=256 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    critic.optim.lr=1e-5 \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    critic.ppo_micro_batch_size_per_gpu=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.project_name=ppo_training \
    trainer.experiment_name=qwen-2.5-7B \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.default_local_dir=/checkpoints \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

Debugging Distributed Jobs

Starting with Ray 2.39, Anyscale provides the Ray Distributed Debugger VSCode extension, which allows you to set breakpoints inside remote Ray tasks. Prerequisites:
  • Visual Studio Code
  • ray[default] >= 2.9.1
  • debugpy >= 1.8.0
Setup steps:
  1. Set the environment variable to enable post-mortem debugging:
    export RAY_DEBUG_POST_MORTEM=1
    
    Remove any legacy debug flags before starting Ray: RAY_DEBUG=legacy and --ray-debugger-external are incompatible with the new debugger.
  2. Add breakpoint() calls inside your @ray.remote functions. Breakpoints only work inside remote functions.
  3. Run your job directly from the command line (do not use a launch.json):
    python job.py
    
  4. When the process hits a breakpoint(), click the Ray Distributed Debugger icon in the VSCode sidebar to attach.
  5. For each subsequent breakpoint, disconnect the current session and click the icon again to attach to the next one.

Legacy Ray Debugger

For older Ray versions or environments where the VSCode extension is not available:
  1. Start the cluster with legacy debug flags:
    # Head node
    RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external
    
    # Worker nodes
    RAY_DEBUG=legacy ray start --address='<head-node-ip>:6379' --ray-debugger-external
    
  2. Add breakpoint() calls to your code, submit the job, then run:
    ray debug
    
    This command blocks until a breakpoint is hit, then drops you into a pdb-style REPL.

Build docs developers (and LLMs) love