Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl is built on a programming model called HybridFlow, introduced in the paper HybridFlow: A Flexible and Efficient RLHF Framework. The core insight is that reinforcement learning for LLMs is a two-level dataflow problem: a high-level control flow that orchestrates the RL algorithm, and a low-level computation flow that executes the neural network math. Keeping these two levels properly separated is what makes verl both flexible and efficient.

RL as a Two-Level Dataflow Problem

Dataflow is a natural way to think about computation. Neural network training is one example — the computation graph describes how tensors flow through operators during forward and backward passes. Reinforcement learning adds a second, higher layer on top:
WorkloadNodeEdge
Neural Network TrainingOperators (+, -, matmul, softmax, …)Tensor movement
Reinforcement LearningHigh-level operators (rollout, model forward, …)Data movement
In the LLM era, each RL “node” (such as rollout or policy update) is itself a multi-process distributed computation involving thousands of GPU cores. This creates the two-level structure:
  • Control flow — defines when and in what order high-level operators run (e.g., rollout → advantage computation → policy update). This encodes the core logic of the RL algorithm.
  • Computation flow — defines how each operator executes across processes (e.g., FSDP sharding, Megatron tensor parallelism, vLLM paged attention).

Design Choices at LLM Scale

Before the LLM era, RL models were small enough to run in a single process, making it natural to embed the computation flow directly inside the control flow. At LLM scale, the computation flow must be multi-process, which forces a choice:
Convert the control flow into a multi-process program and colocate it with the computation flow.Advantage: Achieves optimal performance under a fixed computation flow — communication overhead is minimized.Disadvantage: Computation code becomes tightly coupled to specific controller code. Switching from FSDP to Megatron, for example, requires rewriting both the training engine and the algorithm logic. Dynamic RL algorithms with flexible control flows also become significantly harder to implement.
verl adopts the hybrid approach: the RL algorithm runs on a single controller process, which drives distributed workers that handle the heavy computation.

The HybridFlow Execution Model

The controller runs as a single Python process. Workers (actor, rollout, critic, reward model) run on GPU nodes managed by Ray. When the controller needs computation done, it dispatches data to the workers, waits for results, and proceeds to the next step.
Controller (single process, CPU)

    ├──► ActorRollout WorkerGroup (GPUs)  →  generate_sequences

    ├──► Critic WorkerGroup (GPUs)        →  compute_values

    └──► Reward WorkerGroup (GPUs)        →  compute_scores
Because the controller is single-process, the entire PPO algorithm can be expressed as a plain Python for loop — no multi-process boilerplate, no manual rank checks, no gradient tapes spanning multiple programs:
for prompt in dataloader:
    output = actor_rollout_ref_wg.generate_sequences(prompt)
    old_log_prob = actor_rollout_ref_wg.compute_log_prob(output)
    ref_log_prob = actor_rollout_ref_wg.compute_ref_log_prob(output)
    values = critic_wg.compute_values(output)
    rewards = reward_wg.compute_scores(output)
    advantages = compute_advantages(values, rewards)  # runs on controller
    output = output.union(old_log_prob).union(ref_log_prob)
    output = output.union(values).union(rewards).union(advantages)
    actor_rollout_ref_wg.update_actor(output)
    critic_wg.update_critic(output)
The calls to actor_rollout_ref_wg.generate_sequences(prompt) look like local function calls, but under the hood they fan out to all GPU workers, execute in parallel, and return the concatenated result — all transparently.

Key Abstraction: WorkerGroup

A WorkerGroup is a group of Ray remote actors that execute SPMD (Single Program Multiple Data) functions. The WorkerGroup acts as a proxy: it exposes the same method signatures as the underlying worker class, but automatically handles:
  1. Dispatch — splitting the input DataProto into data-parallel chunks, one per worker.
  2. Execute — invoking the method on every worker in parallel via Ray remote calls.
  3. Collect — gathering and concatenating the results back into a single DataProto.
This dispatch/collect behavior is declared via the @register decorator on the worker method:
from verl.single_controller.base.decorator import register, Dispatch

class ActorRolloutRefWorker(Worker):
    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def generate_sequences(self, prompts: DataProto) -> DataProto:
        prompts = prompts.to(torch.cuda.current_device())
        # ... actual generation logic
Dispatch.DP_COMPUTE_PROTO tells the WorkerGroup to split the batch along the data-parallel dimension before dispatching and to concatenate outputs after collecting. Other dispatch modes (e.g., ONE_TO_ALL) broadcast the same data to every worker. On the controller side, the call is identical regardless of how many GPUs are in the group:
# Single-process controller view
output = actor_rollout_ref_wg.generate_sequences(data)
To use a distributed WorkerGroup instead of a local class, only the instantiation changes:
# Single-process (toy example)
rollout = Rollout()
rollout.generate_sequences(batch)

# Multi-process (verl WorkerGroup)
rollout = RayWorkerGroup(resource_pool=RayResourcePool([4]), ray_cls_with_init=RayClassWithInitArgs(Rollout))
rollout.generate_sequences(batch)

DataProto: The Data Transfer Protocol

All data passed between the controller and workers — and between worker methods — is wrapped in a DataProto. DataProto provides a standard interface that supports both tensor and non-tensor fields, and integrates with the WorkerGroup dispatch/collect machinery.
from verl.protocol import DataProto

# Build from a dict of tensors and non-tensors
data = DataProto.from_dict(
    tensors={
        "input_ids": input_ids,          # torch.Tensor
        "attention_mask": attention_mask, # torch.Tensor
    },
    non_tensors={
        "data_source": np.array(data_sources, dtype=object),  # np.ndarray
        "ground_truth": np.array(ground_truths, dtype=object),
    },
)

# Select a subset of keys
actor_input = data.select(
    batch_keys=["input_ids", "attention_mask"],
    non_tensor_batch_keys=["data_source"],
)

# Merge two DataProtos (must have same batch size)
output = rollout_output.union(advantage_data)
The batch field of a DataProto is a TensorDict — a dictionary of torch.Tensor objects that all share the same leading batch dimension. The non_tensor_batch field is a plain Python dict where every value is a numpy.ndarray of dtype=object. Non-tensor fields carry strings, lists, or other Python objects that cannot be stored as tensors.

DataProto API Summary

MethodDescription
DataProto.from_dict(tensors, non_tensors)Construct from separate tensor and non-tensor dicts
data.select(batch_keys, non_tensor_batch_keys)Return a view with only the specified keys
data.union(other)Merge two DataProto objects with the same batch size
data.chunk(n)Split into n equal-sized DataProto objects along dim 0
DataProto.concat(list_of_protos)Concatenate a list of DataProto objects along dim 0
data.to(device)Move the tensor batch to a device

Comparison: Control Paradigms

PropertySingle-Controller (verl)Multi-Controller (unified)
Algorithm code styleSingle-process PythonMulti-process (rank-aware)
Engine swap (FSDP → Megatron)Config change onlyRewrite required
New algorithm implementationWrite a new for loopSignificant changes across ranks
Data transfer overheadPer-step controller↔worker round tripMinimized (colocated)
Debugging / inspectionStandard Python debuggerRequires distributed tooling
Flexible resource placementYes — remap WorkerGroup to ResourcePoolHarder to change
Because the controller is single-process, you can freely mix Python control flow (if, while, branching on reward thresholds) into the RL loop without any concern about multi-process synchronization. This makes implementing novel RL algorithms such as iterative self-play or multi-agent setups straightforward.

Build docs developers (and LLMs) love