verl is organized as a layered system that separates concerns cleanly: algorithm logic lives in a single-process trainer, distributed computation runs inside worker groups, and the underlying model engines and rollout backends can be swapped out through configuration. Understanding how these layers fit together helps you know where to look when customizing a training run, debugging unexpected behavior, or adding support for a new model architecture.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
Architectural Layers
The system stacks four layers from top to bottom:Component Overview
PPORayTrainer
Single-process orchestrator. Manages the main RL training loop, data loading, and calls to worker group APIs. Runs entirely on CPU.
WorkerGroup
A group of Ray remote actors exposing a unified SPMD interface. Handles data dispatch and result collection on behalf of the trainer.
ActorRolloutRefWorker
Colocates the actor model, rollout engine, and reference policy on the same GPUs. Exposes
generate_sequences, compute_log_prob, and compute_ref_log_prob.TrainingWorker
Generic worker that pairs a training engine with an optimizer. Used for the critic, SFT training, reward model training, and similar roles.
PPORayTrainer
The trainer is the entry point for all algorithm logic. Itsfit() method runs as a single process — no rank checks, no barrier synchronizations. It is responsible for:
- Constructing
WorkerGroupinstances and binding them to resource pools. - Running the PPO main loop: rollout → reward → advantage → actor update → critic update.
- Managing the dataloader and forwarding batches to workers.
- Saving checkpoints at configurable intervals.
verl/trainer/ppo/ray_trainer.py. The entry point that initializes Ray and launches the trainer is verl/trainer/main_ppo.py.
The
main_task entry function (and the trainer’s fit()) must run as a single process. It is best not to schedule this process on the Ray head node because it holds all intermediate RL data in memory and can be memory-intensive.WorkerGroup and Worker Construction
EachWorkerGroup manages a list of Ray remote workers. During initialization, two things happen:
- Worker actors are created — each worker is a Ray remote class instantiated on a specific GPU within a
RayResourcePool. - Methods are bound — every method decorated with
@registeron the underlying worker class is dynamically bound to theWorkerGroup. Callingworker_group.generate_sequences(data)automatically dispatches, executes in parallel, and collects results.
WorkerGroup Roles in PPO
verl’s PPO implementation defines three worker groups by default:ActorRolloutRef WorkerGroup
Colocates the actor model, rollout engine, and reference policy on the same set of GPUs. This colocation enables fast weight transfer between the training-format and inference-format representations using NCCL, avoiding expensive CPU round-trips. In LoRA setups, the reference policy is simply the frozen base model, so colocation is especially efficient.
Critic WorkerGroup
Manages the value function (critic) model. Exposes
compute_values and update_critic. Can be assigned to the same GPU pool as the actor group or to a separate pool, controlled via resource pool configuration.BaseEngine: Training Engine Interface
BaseEngine is the abstract interface that all training engine implementations satisfy. Concrete engines handle model construction, forward/backward passes, optimizer steps, gradient clipping, and checkpoint saving. Three production-grade engines ship with verl:
| Engine | Backend | Notes |
|---|---|---|
| FSDP | PyTorch Fully Sharded Data Parallel | Default; works with any HuggingFace model |
| FSDP2 | PyTorch FSDP v2 | Improved memory efficiency and composability |
| Megatron-LM | NVIDIA Megatron tensor/pipeline parallelism | Required for very large models or pipeline stages |
verl/workers/engine/.
BaseRollout: Rollout Backend Interface
BaseRollout is the abstract interface for sequence generation backends. Swapping backends only requires a config change — the worker code is unaffected.
| Backend | Module | Notes |
|---|---|---|
| vLLM | verl/workers/rollout/vllm/vllm_rollout.py | Default; continuous batching, paged KV cache |
| SGLang | verl/workers/rollout/sglang_rollout/ | Alternative high-throughput backend |
| HuggingFace TGI | verl/workers/rollout/hf_rollout.py | Reference/debugging backend |
The 3D-HybridEngine: Resharding Between Training and Inference
One of the key efficiency mechanisms in verl is the 3D-HybridEngine, which manages the transition between the training-format and inference-format weight representations of the actor model within the same GPU pool. During training, the actor model weights are sharded using FSDP (data parallelism) or Megatron (tensor + pipeline parallelism). During inference/rollout, the same weights need to be arranged in a tensor-parallel layout that vLLM or SGLang expects. Rather than transferring weights to a separate GPU pool over the network, the 3D-HybridEngine reshards the weights in-place using NCCL collective operations, transforming from the training layout to the inference layout and back again between steps. This is why actor and rollout are colocated on the same GPUs inActorRolloutRefWorker.
Resource Placement: RayResourcePool
RayResourcePool maps a set of GPUs in the Ray cluster to a WorkerGroup. You can assign multiple worker groups to the same pool (colocated) or to separate pools.
trainer.n_gpus_per_node, trainer.nnodes, and actor_rollout_ref.* / critic.* / reward.* settings.
Data Flow During a PPO Step
The following diagram traces data through the system for a single PPO iteration:Prompt Sampling
The trainer samples a batch of prompts from the dataloader and wraps them in a
DataProto.Rollout
actor_rollout_ref_wg.generate_sequences(prompts) dispatches prompts to the rollout workers. vLLM (or SGLang) generates responses using the current actor weights. The response tokens, along with prompt tokens, are returned as a DataProto.Log-Probability Computation
actor_rollout_ref_wg.compute_log_prob(output) and compute_ref_log_prob(output) run forward passes to compute token-level log-probabilities under the current actor and the frozen reference policy, respectively.Value Estimation
critic_wg.compute_values(output) runs the value function forward pass to produce per-token value estimates.Reward Computation
The
RewardManager (or reward_wg.compute_scores(output) for model-based rewards) computes a scalar reward for each response.Advantage Computation
compute_advantages(values, rewards) runs on the controller process (no GPU required) and computes per-token advantages using GAE.