HybridFlow: verl's RL Programming Model

verl is built on a programming model called HybridFlow, introduced in the paper HybridFlow: A Flexible and Efficient RLHF Framework. The core insight is that reinforcement learning for LLMs is a two-level dataflow problem: a high-level control flow that orchestrates the RL algorithm, and a low-level computation flow that executes the neural network math. Keeping these two levels properly separated is what makes verl both flexible and efficient.

RL as a Two-Level Dataflow Problem

Dataflow is a natural way to think about computation. Neural network training is one example — the computation graph describes how tensors flow through operators during forward and backward passes. Reinforcement learning adds a second, higher layer on top:

Workload	Node	Edge
Neural Network Training	Operators (`+`, `-`, `matmul`, `softmax`, …)	Tensor movement
Reinforcement Learning	High-level operators (rollout, model forward, …)	Data movement

In the LLM era, each RL “node” (such as rollout or policy update) is itself a multi-process distributed computation involving thousands of GPU cores. This creates the two-level structure:

Control flow — defines when and in what order high-level operators run (e.g., rollout → advantage computation → policy update). This encodes the core logic of the RL algorithm.
Computation flow — defines how each operator executes across processes (e.g., FSDP sharding, Megatron tensor parallelism, vLLM paged attention).

Design Choices at LLM Scale

Before the LLM era, RL models were small enough to run in a single process, making it natural to embed the computation flow directly inside the control flow. At LLM scale, the computation flow must be multi-process, which forces a choice:

Unified Multi-Controller
HybridFlow (verl)

Convert the control flow into a multi-process program and colocate it with the computation flow.Advantage: Achieves optimal performance under a fixed computation flow — communication overhead is minimized.Disadvantage: Computation code becomes tightly coupled to specific controller code. Switching from FSDP to Megatron, for example, requires rewriting both the training engine and the algorithm logic. Dynamic RL algorithms with flexible control flows also become significantly harder to implement.

verl adopts the hybrid approach: the RL algorithm runs on a single controller process, which drives distributed workers that handle the heavy computation.

The HybridFlow Execution Model

The controller runs as a single Python process. Workers (actor, rollout, critic, reward model) run on GPU nodes managed by Ray. When the controller needs computation done, it dispatches data to the workers, waits for results, and proceeds to the next step.

Controller (single process, CPU)
    │
    ├──► ActorRollout WorkerGroup (GPUs)  →  generate_sequences
    │
    ├──► Critic WorkerGroup (GPUs)        →  compute_values
    │
    └──► Reward WorkerGroup (GPUs)        →  compute_scores

Because the controller is single-process, the entire PPO algorithm can be expressed as a plain Python for loop — no multi-process boilerplate, no manual rank checks, no gradient tapes spanning multiple programs:

for prompt in dataloader:
    output = actor_rollout_ref_wg.generate_sequences(prompt)
    old_log_prob = actor_rollout_ref_wg.compute_log_prob(output)
    ref_log_prob = actor_rollout_ref_wg.compute_ref_log_prob(output)
    values = critic_wg.compute_values(output)
    rewards = reward_wg.compute_scores(output)
    advantages = compute_advantages(values, rewards)  # runs on controller
    output = output.union(old_log_prob).union(ref_log_prob)
    output = output.union(values).union(rewards).union(advantages)
    actor_rollout_ref_wg.update_actor(output)
    critic_wg.update_critic(output)

The calls to actor_rollout_ref_wg.generate_sequences(prompt) look like local function calls, but under the hood they fan out to all GPU workers, execute in parallel, and return the concatenated result — all transparently.

Key Abstraction: WorkerGroup

A WorkerGroup is a group of Ray remote actors that execute SPMD (Single Program Multiple Data) functions. The WorkerGroup acts as a proxy: it exposes the same method signatures as the underlying worker class, but automatically handles:

Dispatch — splitting the input DataProto into data-parallel chunks, one per worker.
Execute — invoking the method on every worker in parallel via Ray remote calls.
Collect — gathering and concatenating the results back into a single DataProto.

This dispatch/collect behavior is declared via the @register decorator on the worker method:

from verl.single_controller.base.decorator import register, Dispatch

class ActorRolloutRefWorker(Worker):
    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    def generate_sequences(self, prompts: DataProto) -> DataProto:
        prompts = prompts.to(torch.cuda.current_device())
        # ... actual generation logic

Dispatch.DP_COMPUTE_PROTO tells the WorkerGroup to split the batch along the data-parallel dimension before dispatching and to concatenate outputs after collecting. Other dispatch modes (e.g., ONE_TO_ALL) broadcast the same data to every worker. On the controller side, the call is identical regardless of how many GPUs are in the group:

# Single-process controller view
output = actor_rollout_ref_wg.generate_sequences(data)

To use a distributed WorkerGroup instead of a local class, only the instantiation changes:

# Single-process (toy example)
rollout = Rollout()
rollout.generate_sequences(batch)

# Multi-process (verl WorkerGroup)
rollout = RayWorkerGroup(resource_pool=RayResourcePool([4]), ray_cls_with_init=RayClassWithInitArgs(Rollout))
rollout.generate_sequences(batch)

DataProto: The Data Transfer Protocol

All data passed between the controller and workers — and between worker methods — is wrapped in a DataProto. DataProto provides a standard interface that supports both tensor and non-tensor fields, and integrates with the WorkerGroup dispatch/collect machinery.

from verl.protocol import DataProto

# Build from a dict of tensors and non-tensors
data = DataProto.from_dict(
    tensors={
        "input_ids": input_ids,          # torch.Tensor
        "attention_mask": attention_mask, # torch.Tensor
    },
    non_tensors={
        "data_source": np.array(data_sources, dtype=object),  # np.ndarray
        "ground_truth": np.array(ground_truths, dtype=object),
    },
)

# Select a subset of keys
actor_input = data.select(
    batch_keys=["input_ids", "attention_mask"],
    non_tensor_batch_keys=["data_source"],
)

# Merge two DataProtos (must have same batch size)
output = rollout_output.union(advantage_data)

The batch field of a DataProto is a TensorDict — a dictionary of torch.Tensor objects that all share the same leading batch dimension. The non_tensor_batch field is a plain Python dict where every value is a numpy.ndarray of dtype=object. Non-tensor fields carry strings, lists, or other Python objects that cannot be stored as tensors.

DataProto API Summary

Method	Description
`DataProto.from_dict(tensors, non_tensors)`	Construct from separate tensor and non-tensor dicts
`data.select(batch_keys, non_tensor_batch_keys)`	Return a view with only the specified keys
`data.union(other)`	Merge two `DataProto` objects with the same batch size
`data.chunk(n)`	Split into `n` equal-sized `DataProto` objects along dim 0
`DataProto.concat(list_of_protos)`	Concatenate a list of `DataProto` objects along dim 0
`data.to(device)`	Move the tensor batch to a device

Comparison: Control Paradigms

Property	Single-Controller (verl)	Multi-Controller (unified)
Algorithm code style	Single-process Python	Multi-process (rank-aware)
Engine swap (FSDP → Megatron)	Config change only	Rewrite required
New algorithm implementation	Write a new `for` loop	Significant changes across ranks
Data transfer overhead	Per-step controller↔worker round trip	Minimized (colocated)
Debugging / inspection	Standard Python debugger	Requires distributed tooling
Flexible resource placement	Yes — remap `WorkerGroup` to `ResourcePool`	Harder to change

Because the controller is single-process, you can freely mix Python control flow (if, while, branching on reward thresholds) into the RL loop without any concern about multi-process synchronization. This makes implementing novel RL algorithms such as iterative self-play or multi-agent setups straightforward.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

HybridFlow: verl's RL Programming Model

RL as a Two-Level Dataflow Problem

Design Choices at LLM Scale

The HybridFlow Execution Model

Key Abstraction: WorkerGroup

DataProto: The Data Transfer Protocol

DataProto API Summary

Comparison: Control Paradigms

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​RL as a Two-Level Dataflow Problem

​Design Choices at LLM Scale

​The HybridFlow Execution Model

​Key Abstraction: WorkerGroup

​DataProto: The Data Transfer Protocol

​DataProto API Summary

​Comparison: Control Paradigms

Build docs developers (and LLMs) love

RL as a Two-Level Dataflow Problem

Design Choices at LLM Scale

The HybridFlow Execution Model

Key Abstraction: WorkerGroup

DataProto: The Data Transfer Protocol

DataProto API Summary

Comparison: Control Paradigms