verl System Architecture and Component Overview

verl is organized as a layered system that separates concerns cleanly: algorithm logic lives in a single-process trainer, distributed computation runs inside worker groups, and the underlying model engines and rollout backends can be swapped out through configuration. Understanding how these layers fit together helps you know where to look when customizing a training run, debugging unexpected behavior, or adding support for a new model architecture.

Architectural Layers

The system stacks four layers from top to bottom:

┌─────────────────────────────────────────────────────┐
│  Trainer  (PPORayTrainer — single process, CPU)      │
│  Orchestrates RL loop, data loading, worker calls   │
├─────────────────────────────────────────────────────┤
│  WorkerGroups  (Ray remote actors — GPU nodes)       │
│  ActorRolloutRefWorker · TrainingWorker              │
├─────────────────────────────────────────────────────┤
│  Model Engines  (BaseEngine)                         │
│  FSDP · FSDP2 · Megatron-LM · TorchTitan           │
├─────────────────────────────────────────────────────┤
│  Rollout Backends  (BaseRollout)                     │
│  vLLM · SGLang · HuggingFace TGI                    │
└─────────────────────────────────────────────────────┘

Each layer communicates with the one immediately above or below it through well-defined interfaces, which is what makes individual components replaceable without touching the rest of the system.

Component Overview

PPORayTrainer

Single-process orchestrator. Manages the main RL training loop, data loading, and calls to worker group APIs. Runs entirely on CPU.

WorkerGroup

A group of Ray remote actors exposing a unified SPMD interface. Handles data dispatch and result collection on behalf of the trainer.

ActorRolloutRefWorker

Colocates the actor model, rollout engine, and reference policy on the same GPUs. Exposes generate_sequences, compute_log_prob, and compute_ref_log_prob.

TrainingWorker

Generic worker that pairs a training engine with an optimizer. Used for the critic, SFT training, reward model training, and similar roles.

PPORayTrainer

The trainer is the entry point for all algorithm logic. Its fit() method runs as a single process — no rank checks, no barrier synchronizations. It is responsible for:

Constructing WorkerGroup instances and binding them to resource pools.
Running the PPO main loop: rollout → reward → advantage → actor update → critic update.
Managing the dataloader and forwarding batches to workers.
Saving checkpoints at configurable intervals.

The trainer lives in verl/trainer/ppo/ray_trainer.py. The entry point that initializes Ray and launches the trainer is verl/trainer/main_ppo.py.

The main_task entry function (and the trainer’s fit()) must run as a single process. It is best not to schedule this process on the Ray head node because it holds all intermediate RL data in memory and can be memory-intensive.

WorkerGroup and Worker Construction

Each WorkerGroup manages a list of Ray remote workers. During initialization, two things happen:

Worker actors are created — each worker is a Ray remote class instantiated on a specific GPU within a RayResourcePool.
Methods are bound — every method decorated with @register on the underlying worker class is dynamically bound to the WorkerGroup. Calling worker_group.generate_sequences(data) automatically dispatches, executes in parallel, and collects results.

Workers are defined with explicit dispatch semantics:

from verl.single_controller.base.decorator import register, Dispatch

class ActorRolloutRefWorker(Worker):
    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def generate_sequences(self, prompts: DataProto) -> DataProto:
        ...

    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def compute_log_prob(self, data: DataProto) -> DataProto:
        ...

    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def compute_ref_log_prob(self, data: DataProto) -> DataProto:
        ...

    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def update_actor(self, data: DataProto) -> DataProto:
        ...

WorkerGroup Roles in PPO

verl’s PPO implementation defines three worker groups by default:

ActorRolloutRef WorkerGroup

Colocates the actor model, rollout engine, and reference policy on the same set of GPUs. This colocation enables fast weight transfer between the training-format and inference-format representations using NCCL, avoiding expensive CPU round-trips. In LoRA setups, the reference policy is simply the frozen base model, so colocation is especially efficient.

Critic WorkerGroup

Manages the value function (critic) model. Exposes compute_values and update_critic. Can be assigned to the same GPU pool as the actor group or to a separate pool, controlled via resource pool configuration.

Reward WorkerGroup

Manages the reward model when model-based rewards are used. When only rule-based rewards are needed, this group may be omitted and the RewardManager runs directly on the controller process.

BaseEngine: Training Engine Interface

BaseEngine is the abstract interface that all training engine implementations satisfy. Concrete engines handle model construction, forward/backward passes, optimizer steps, gradient clipping, and checkpoint saving. Three production-grade engines ship with verl:

Engine	Backend	Notes
FSDP	PyTorch Fully Sharded Data Parallel	Default; works with any HuggingFace model
FSDP2	PyTorch FSDP v2	Improved memory efficiency and composability
Megatron-LM	NVIDIA Megatron tensor/pipeline parallelism	Required for very large models or pipeline stages

Engine implementations live under verl/workers/engine/.

BaseRollout: Rollout Backend Interface

BaseRollout is the abstract interface for sequence generation backends. Swapping backends only requires a config change — the worker code is unaffected.

Backend	Module	Notes
vLLM	`verl/workers/rollout/vllm/vllm_rollout.py`	Default; continuous batching, paged KV cache
SGLang	`verl/workers/rollout/sglang_rollout/`	Alternative high-throughput backend
HuggingFace TGI	`verl/workers/rollout/hf_rollout.py`	Reference/debugging backend

The 3D-HybridEngine: Resharding Between Training and Inference

One of the key efficiency mechanisms in verl is the 3D-HybridEngine, which manages the transition between the training-format and inference-format weight representations of the actor model within the same GPU pool. During training, the actor model weights are sharded using FSDP (data parallelism) or Megatron (tensor + pipeline parallelism). During inference/rollout, the same weights need to be arranged in a tensor-parallel layout that vLLM or SGLang expects. Rather than transferring weights to a separate GPU pool over the network, the 3D-HybridEngine reshards the weights in-place using NCCL collective operations, transforming from the training layout to the inference layout and back again between steps. This is why actor and rollout are colocated on the same GPUs in ActorRolloutRefWorker.

Training step (FSDP layout)
        ↓  reshard via NCCL  (3D-HybridEngine)
Rollout step (tensor-parallel layout for vLLM)
        ↓  reshard via NCCL  (3D-HybridEngine)
Training step (FSDP layout)

Resource Placement: RayResourcePool

RayResourcePool maps a set of GPUs in the Ray cluster to a WorkerGroup. You can assign multiple worker groups to the same pool (colocated) or to separate pools.

from verl.single_controller.ray.base import RayResourcePool, RayWorkerGroup, RayClassWithInitArgs

# One pool, all worker groups share GPUs (colocated)
resource_pool = RayResourcePool(
    process_on_nodes=[8, 8],  # 8 GPUs on each of 2 nodes = 16 GPUs total
    max_colocate_count=3,     # up to 3 WorkerGroups share this pool
)

Colocated placement avoids inter-node data transfers between actor and critic. Separate placement allows the critic or reward model to use a different number of GPUs from the actor, which is useful when model sizes differ significantly. The mapping from roles to resource pools is declared in the trainer config under trainer.n_gpus_per_node, trainer.nnodes, and actor_rollout_ref.* / critic.* / reward.* settings.

Data Flow During a PPO Step

The following diagram traces data through the system for a single PPO iteration:

Prompt Sampling

The trainer samples a batch of prompts from the dataloader and wraps them in a DataProto.

Rollout

actor_rollout_ref_wg.generate_sequences(prompts) dispatches prompts to the rollout workers. vLLM (or SGLang) generates responses using the current actor weights. The response tokens, along with prompt tokens, are returned as a DataProto.

Log-Probability Computation

actor_rollout_ref_wg.compute_log_prob(output) and compute_ref_log_prob(output) run forward passes to compute token-level log-probabilities under the current actor and the frozen reference policy, respectively.

Value Estimation

critic_wg.compute_values(output) runs the value function forward pass to produce per-token value estimates.

Reward Computation

The RewardManager (or reward_wg.compute_scores(output) for model-based rewards) computes a scalar reward for each response.

Advantage Computation

compute_advantages(values, rewards) runs on the controller process (no GPU required) and computes per-token advantages using GAE.

Policy and Critic Update

actor_rollout_ref_wg.update_actor(output) and critic_wg.update_critic(output) run the backward pass and optimizer step on the respective workers.

Repository Organization

The key source files are organized as follows:

verl/
  trainer/
    main_ppo.py          # Entry point: initializes Ray, launches trainer
    ppo/
      ray_trainer.py     # PPORayTrainer: main RL loop
    sft_trainer.py       # SFT trainer (FSDP backend)
  config/
    ppo_trainer.yaml     # Full RL trainer configuration template
  workers/
    engine_workers.py    # ActorRolloutRefWorker, TrainingWorker
    engine/
      fsdp/              # FSDP / FSDP2 engine implementations
      megatron/          # Megatron-LM engine implementations
    rollout/
      vllm/              # vLLM rollout backend
      sglang_rollout/    # SGLang rollout backend
      hf_rollout.py      # HuggingFace TGI rollout backend
    reward_manager/      # NaiveRewardManager, BatchRewardManager, etc.
  single_controller/
    ray/
      base.py            # RayWorkerGroup, RayResourcePool
    base/
      decorator.py       # @register, Dispatch modes
      worker_group.py    # WorkerGroup base class
  utils/
    reward_score/        # Rule-based reward functions (GSM8K, MATH, …)
    dataset/             # Dataset utilities for SFT / RM / RL

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

verl System Architecture and Component Overview

Architectural Layers

Component Overview

PPORayTrainer

WorkerGroup

ActorRolloutRefWorker

TrainingWorker

PPORayTrainer

WorkerGroup and Worker Construction

WorkerGroup Roles in PPO

BaseEngine: Training Engine Interface

BaseRollout: Rollout Backend Interface

The 3D-HybridEngine: Resharding Between Training and Inference

Resource Placement: RayResourcePool

Data Flow During a PPO Step

Repository Organization

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Architectural Layers

​Component Overview

PPORayTrainer

WorkerGroup

ActorRolloutRefWorker

TrainingWorker

​PPORayTrainer

​WorkerGroup and Worker Construction

​WorkerGroup Roles in PPO

​BaseEngine: Training Engine Interface

​BaseRollout: Rollout Backend Interface

​The 3D-HybridEngine: Resharding Between Training and Inference

​Resource Placement: RayResourcePool

​Data Flow During a PPO Step

​Repository Organization

Build docs developers (and LLMs) love

Architectural Layers

Component Overview

PPORayTrainer

WorkerGroup and Worker Construction

WorkerGroup Roles in PPO

BaseEngine: Training Engine Interface

BaseRollout: Rollout Backend Interface

The 3D-HybridEngine: Resharding Between Training and Inference

Resource Placement: RayResourcePool

Data Flow During a PPO Step

Repository Organization