Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl is organized as a layered system that separates concerns cleanly: algorithm logic lives in a single-process trainer, distributed computation runs inside worker groups, and the underlying model engines and rollout backends can be swapped out through configuration. Understanding how these layers fit together helps you know where to look when customizing a training run, debugging unexpected behavior, or adding support for a new model architecture.

Architectural Layers

The system stacks four layers from top to bottom:
┌─────────────────────────────────────────────────────┐
│  Trainer  (PPORayTrainer — single process, CPU)      │
│  Orchestrates RL loop, data loading, worker calls   │
├─────────────────────────────────────────────────────┤
│  WorkerGroups  (Ray remote actors — GPU nodes)       │
│  ActorRolloutRefWorker · TrainingWorker              │
├─────────────────────────────────────────────────────┤
│  Model Engines  (BaseEngine)                         │
│  FSDP · FSDP2 · Megatron-LM · TorchTitan           │
├─────────────────────────────────────────────────────┤
│  Rollout Backends  (BaseRollout)                     │
│  vLLM · SGLang · HuggingFace TGI                    │
└─────────────────────────────────────────────────────┘
Each layer communicates with the one immediately above or below it through well-defined interfaces, which is what makes individual components replaceable without touching the rest of the system.

Component Overview

PPORayTrainer

Single-process orchestrator. Manages the main RL training loop, data loading, and calls to worker group APIs. Runs entirely on CPU.

WorkerGroup

A group of Ray remote actors exposing a unified SPMD interface. Handles data dispatch and result collection on behalf of the trainer.

ActorRolloutRefWorker

Colocates the actor model, rollout engine, and reference policy on the same GPUs. Exposes generate_sequences, compute_log_prob, and compute_ref_log_prob.

TrainingWorker

Generic worker that pairs a training engine with an optimizer. Used for the critic, SFT training, reward model training, and similar roles.

PPORayTrainer

The trainer is the entry point for all algorithm logic. Its fit() method runs as a single process — no rank checks, no barrier synchronizations. It is responsible for:
  • Constructing WorkerGroup instances and binding them to resource pools.
  • Running the PPO main loop: rollout → reward → advantage → actor update → critic update.
  • Managing the dataloader and forwarding batches to workers.
  • Saving checkpoints at configurable intervals.
The trainer lives in verl/trainer/ppo/ray_trainer.py. The entry point that initializes Ray and launches the trainer is verl/trainer/main_ppo.py.
The main_task entry function (and the trainer’s fit()) must run as a single process. It is best not to schedule this process on the Ray head node because it holds all intermediate RL data in memory and can be memory-intensive.

WorkerGroup and Worker Construction

Each WorkerGroup manages a list of Ray remote workers. During initialization, two things happen:
  1. Worker actors are created — each worker is a Ray remote class instantiated on a specific GPU within a RayResourcePool.
  2. Methods are bound — every method decorated with @register on the underlying worker class is dynamically bound to the WorkerGroup. Calling worker_group.generate_sequences(data) automatically dispatches, executes in parallel, and collects results.
Workers are defined with explicit dispatch semantics:
from verl.single_controller.base.decorator import register, Dispatch

class ActorRolloutRefWorker(Worker):
    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def generate_sequences(self, prompts: DataProto) -> DataProto:
        ...

    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def compute_log_prob(self, data: DataProto) -> DataProto:
        ...

    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def compute_ref_log_prob(self, data: DataProto) -> DataProto:
        ...

    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def update_actor(self, data: DataProto) -> DataProto:
        ...

WorkerGroup Roles in PPO

verl’s PPO implementation defines three worker groups by default:
1

ActorRolloutRef WorkerGroup

Colocates the actor model, rollout engine, and reference policy on the same set of GPUs. This colocation enables fast weight transfer between the training-format and inference-format representations using NCCL, avoiding expensive CPU round-trips. In LoRA setups, the reference policy is simply the frozen base model, so colocation is especially efficient.
2

Critic WorkerGroup

Manages the value function (critic) model. Exposes compute_values and update_critic. Can be assigned to the same GPU pool as the actor group or to a separate pool, controlled via resource pool configuration.
3

Reward WorkerGroup

Manages the reward model when model-based rewards are used. When only rule-based rewards are needed, this group may be omitted and the RewardManager runs directly on the controller process.

BaseEngine: Training Engine Interface

BaseEngine is the abstract interface that all training engine implementations satisfy. Concrete engines handle model construction, forward/backward passes, optimizer steps, gradient clipping, and checkpoint saving. Three production-grade engines ship with verl:
EngineBackendNotes
FSDPPyTorch Fully Sharded Data ParallelDefault; works with any HuggingFace model
FSDP2PyTorch FSDP v2Improved memory efficiency and composability
Megatron-LMNVIDIA Megatron tensor/pipeline parallelismRequired for very large models or pipeline stages
Engine implementations live under verl/workers/engine/.

BaseRollout: Rollout Backend Interface

BaseRollout is the abstract interface for sequence generation backends. Swapping backends only requires a config change — the worker code is unaffected.
BackendModuleNotes
vLLMverl/workers/rollout/vllm/vllm_rollout.pyDefault; continuous batching, paged KV cache
SGLangverl/workers/rollout/sglang_rollout/Alternative high-throughput backend
HuggingFace TGIverl/workers/rollout/hf_rollout.pyReference/debugging backend

The 3D-HybridEngine: Resharding Between Training and Inference

One of the key efficiency mechanisms in verl is the 3D-HybridEngine, which manages the transition between the training-format and inference-format weight representations of the actor model within the same GPU pool. During training, the actor model weights are sharded using FSDP (data parallelism) or Megatron (tensor + pipeline parallelism). During inference/rollout, the same weights need to be arranged in a tensor-parallel layout that vLLM or SGLang expects. Rather than transferring weights to a separate GPU pool over the network, the 3D-HybridEngine reshards the weights in-place using NCCL collective operations, transforming from the training layout to the inference layout and back again between steps. This is why actor and rollout are colocated on the same GPUs in ActorRolloutRefWorker.
Training step (FSDP layout)
        ↓  reshard via NCCL  (3D-HybridEngine)
Rollout step (tensor-parallel layout for vLLM)
        ↓  reshard via NCCL  (3D-HybridEngine)
Training step (FSDP layout)

Resource Placement: RayResourcePool

RayResourcePool maps a set of GPUs in the Ray cluster to a WorkerGroup. You can assign multiple worker groups to the same pool (colocated) or to separate pools.
from verl.single_controller.ray.base import RayResourcePool, RayWorkerGroup, RayClassWithInitArgs

# One pool, all worker groups share GPUs (colocated)
resource_pool = RayResourcePool(
    process_on_nodes=[8, 8],  # 8 GPUs on each of 2 nodes = 16 GPUs total
    max_colocate_count=3,     # up to 3 WorkerGroups share this pool
)
Colocated placement avoids inter-node data transfers between actor and critic. Separate placement allows the critic or reward model to use a different number of GPUs from the actor, which is useful when model sizes differ significantly. The mapping from roles to resource pools is declared in the trainer config under trainer.n_gpus_per_node, trainer.nnodes, and actor_rollout_ref.* / critic.* / reward.* settings.

Data Flow During a PPO Step

The following diagram traces data through the system for a single PPO iteration:
1

Prompt Sampling

The trainer samples a batch of prompts from the dataloader and wraps them in a DataProto.
2

Rollout

actor_rollout_ref_wg.generate_sequences(prompts) dispatches prompts to the rollout workers. vLLM (or SGLang) generates responses using the current actor weights. The response tokens, along with prompt tokens, are returned as a DataProto.
3

Log-Probability Computation

actor_rollout_ref_wg.compute_log_prob(output) and compute_ref_log_prob(output) run forward passes to compute token-level log-probabilities under the current actor and the frozen reference policy, respectively.
4

Value Estimation

critic_wg.compute_values(output) runs the value function forward pass to produce per-token value estimates.
5

Reward Computation

The RewardManager (or reward_wg.compute_scores(output) for model-based rewards) computes a scalar reward for each response.
6

Advantage Computation

compute_advantages(values, rewards) runs on the controller process (no GPU required) and computes per-token advantages using GAE.
7

Policy and Critic Update

actor_rollout_ref_wg.update_actor(output) and critic_wg.update_critic(output) run the backward pass and optimizer step on the respective workers.

Repository Organization

The key source files are organized as follows:
verl/
  trainer/
    main_ppo.py          # Entry point: initializes Ray, launches trainer
    ppo/
      ray_trainer.py     # PPORayTrainer: main RL loop
    sft_trainer.py       # SFT trainer (FSDP backend)
  config/
    ppo_trainer.yaml     # Full RL trainer configuration template
  workers/
    engine_workers.py    # ActorRolloutRefWorker, TrainingWorker
    engine/
      fsdp/              # FSDP / FSDP2 engine implementations
      megatron/          # Megatron-LM engine implementations
    rollout/
      vllm/              # vLLM rollout backend
      sglang_rollout/    # SGLang rollout backend
      hf_rollout.py      # HuggingFace TGI rollout backend
    reward_manager/      # NaiveRewardManager, BatchRewardManager, etc.
  single_controller/
    ray/
      base.py            # RayWorkerGroup, RayResourcePool
    base/
      decorator.py       # @register, Dispatch modes
      worker_group.py    # WorkerGroup base class
  utils/
    reward_score/        # Rule-based reward functions (GSM8K, MATH, …)
    dataset/             # Dataset utilities for SFT / RM / RL

Build docs developers (and LLMs) love