Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl’s model engine layer provides a clean abstraction between the training logic and the underlying distributed parallelism strategy. Whether a worker uses FSDP, FSDP2, or Megatron-LM, the training code above it stays the same — it calls the same train_batch, infer_batch, save_checkpoint, and load_checkpoint APIs regardless of which backend is active. This engine-agnostic design is what lets verl support a wide range of models and scales from a single unified training loop.

Backend Support Matrix

Each backend makes different trade-offs between ease of use, model coverage, and raw scalability. The table below summarizes the current state:
BackendModel SupportScalabilityModel DefinitionNotes
FSDP + UlyssesAny HuggingFace model from day oneDense models: good. MoE models: poorHuggingFace + monkey patchMonkey patches can be impacted by transformers version upgrades
MCore (Megatron)Limited set of modelsBest — full 3D parallelismGPTModel (one model definition for all)Supporting new models requires significant effort
verl monkey-patches the attention function in HuggingFace models to add DeepSpeed Ulysses sequence parallelism support. VLM models also receive a monkey patch that enables FSDP to handle mixed batches containing both image and text-only samples.

Class Hierarchy

All workers and trainers in verl run in SPMD mode — every GPU rank executes the same code. The SFT/DPO/RM trainers are invoked directly via torchrun; the Actor/Critic workers are wrapped by a RayWorkerGroup and expose their APIs as RPCs to the single controller. The engine stack has three levels:
1
Base Engine Level
2
The base engine (BaseEngine) implements the foundational infrastructure that every backend shares: model initialization from a HuggingFace config, optimizer construction, learning-rate scheduler setup, weight sharding, and checkpoint management. The base engine does not implement a forward pass — that is left to the next level.
3
The full API reference lives at verl/workers/engine/base.py.
4
Full Engine Level
5
Full engine classes (e.g., FSDPEngineWithLMHead, MegatronEngineWithLMHead) subclass the base engine and implement forward_step — the function that actually runs the model forward pass and computes the loss. There are two model types:
6
Model typeInputOutputLanguage modeltext / image / video / audio tokenslogits for next token predictionValue modeltext / image / video / audio tokensscalar value estimate
7
Worker / SPMD Trainer Level
8
The worker layer (TrainingWorker, ActorRolloutRefWorker) is engine-agnostic. It implements training logic — PPO epochs, gradient accumulation, actor-rollout weight sync — using only the abstract engine APIs. The concrete backend is injected at construction time via EngineRegistry.

Backend Details

The FSDP backend wraps any HuggingFace model with PyTorch Fully Sharded Data Parallel, making it trivially easy to add new models. FSDP2 is the next-generation implementation with improved memory management.Key features:
  • Works out of the box with any transformers model
  • Sequence parallelism via DeepSpeed Ulysses (monkey-patched attention)
  • dtensor_weight_loader for efficient weight synchronization to the rollout engine (vLLM / SGLang) without redundant copies
  • Parameter and optimizer offload to CPU for memory-constrained setups
Configuration:
actor_rollout_ref:
  actor:
    strategy: fsdp      # or fsdp2
  ref:
    strategy: fsdp
critic:
  strategy: fsdp
Trade-offs:
  • Monkey patches are sensitive to transformers version changes
  • MoE models exhibit poor FSDP scaling due to load imbalance across expert shards

3D-HybridEngine and Weight Synchronization

When actor training and rollout generation are colocated on the same GPUs, verl must reshard model weights from the training layout (e.g., FSDP shards across all ranks) into the inference layout (e.g., vLLM tensor-parallel groups) between each training step. This is the 3D-HybridEngine. Rather than writing the full weight tensor to host memory and re-reading it, the hybrid engine exports per-tensor parameters directly via engine.get_per_tensor_param() and passes them to the rollout engine’s update_weights() call in-process. For FSDP, this avoids any redundant copy; the parameters are gathered once and streamed directly into the vLLM/SGLang weight buffers.
# Inside ActorRolloutRefWorker.update_weights (naive/colocated sync)
per_tensor_param, peft_config = self.actor.engine.get_per_tensor_param(
    layered_summon=self.layered_summon,
    base_sync_done=True,
)
await self.rollout.update_weights(
    per_tensor_param,
    peft_config=peft_config,
    base_sync_done=True,
    global_steps=global_steps,
)
For disaggregated async training (trainer and rollout on separate node pools), weights are transferred via the CheckpointEngine.send_weights path instead.

Checkpoint System

Each engine is responsible for saving and loading the complete training state — model weights, optimizer state, and LR scheduler state. The checkpointing flow has two layers:
  1. Intermediate sharded checkpoints — saved by the engine in its native sharded format (FSDP shards, Megatron tensor-parallel slices, etc.) at each save_freq step. These are fast to write and read.
  2. HuggingFace export — engines that use HuggingFace model definitions can merge shards back to HF format using transformers utilities. Engines based on custom model definitions (e.g., Megatron) must use a dedicated merge script (e.g., mbridge).
The engine constructs the model from a HuggingFace config, loads weights from a HuggingFace checkpoint, and then hands off to the sharding strategy.

EngineRegistry Dispatch Table

The EngineRegistry selects the concrete engine class from the (model_type, backend, device) triple specified in your Hydra config:
model_typebackenddeviceEngine class
language_modelfsdp / fsdp2cuda / npuFSDPEngineWithLMHead
language_modelmegatroncudaMegatronEngineWithLMHead
language_modelmegatronnpuMindspeedEngineWithLMHead
language_modelmindspeed_megatronnpuMindSpeedMegatronEngineWithLMHead
language_modelautomodelcudaAutomodelEngineWithLMHead
language_modelveomnicuda / npuVeOmniEngineWithLMHead
language_modeltorchtitancuda / npuTorchTitanEngineWithLMHead
value_modelfsdp / fsdp2cuda / npuFSDPEngineWithValueHead
value_modelmegatroncudaMegatronEngineWithValueHead

Extending the Engine

Adding a New Backend

1
Create the engine folder
2
Create a new directory under verl/workers/engine/<your_backend>/ and implement transformer_impl.py with a BaseEngine subclass. Register it with @EngineRegistry.register(model_type=..., backend=...).
3
Add to the SFT test harness
4
Add the engine config to the GSM8k SFT trainer script at tests/special_e2e/sft/run_sft_engine_gsm8k.sh.
5
Run the correctness tests
6
Invoke tests/special_e2e/sft/test_sft_engine_all.sh. This script runs all backends and configurations, comparing loss and gradient norm at step 1 to verify numerical correctness.
The worker layer (TrainingWorker / ActorRolloutRefWorker) is already engine-agnostic — once your backend is registered and engine_config.strategy is set to its name, verl will use it automatically without any changes to training logic.

Adding a New Model Type

Currently, language_model (logit output) and value_model (scalar output) cover all standard RL use cases. Adding a new model type — for example, a model with simultaneous text and audio output like Qwen3-Omni — requires changes across the engine abstraction. Please open a discussion before proceeding.

Engine Workers

See how TrainingWorker and ActorRolloutRefWorker wrap the engine and expose RPCs to the Ray trainer.

Ray Trainer

Understand how PPORayTrainer drives the full distributed training loop.

Build docs developers (and LLMs) love