verl’s model engine layer provides a clean abstraction between the training logic and the underlying distributed parallelism strategy. Whether a worker uses FSDP, FSDP2, or Megatron-LM, the training code above it stays the same — it calls the sameDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
train_batch, infer_batch, save_checkpoint, and load_checkpoint APIs regardless of which backend is active. This engine-agnostic design is what lets verl support a wide range of models and scales from a single unified training loop.
Backend Support Matrix
Each backend makes different trade-offs between ease of use, model coverage, and raw scalability. The table below summarizes the current state:| Backend | Model Support | Scalability | Model Definition | Notes |
|---|---|---|---|---|
| FSDP + Ulysses | Any HuggingFace model from day one | Dense models: good. MoE models: poor | HuggingFace + monkey patch | Monkey patches can be impacted by transformers version upgrades |
| MCore (Megatron) | Limited set of models | Best — full 3D parallelism | GPTModel (one model definition for all) | Supporting new models requires significant effort |
verl monkey-patches the attention function in HuggingFace models to add DeepSpeed Ulysses sequence parallelism support. VLM models also receive a monkey patch that enables FSDP to handle mixed batches containing both image and text-only samples.
Class Hierarchy
All workers and trainers in verl run in SPMD mode — every GPU rank executes the same code. The SFT/DPO/RM trainers are invoked directly viatorchrun; the Actor/Critic workers are wrapped by a RayWorkerGroup and expose their APIs as RPCs to the single controller.
The engine stack has three levels:
The base engine (
BaseEngine) implements the foundational infrastructure that every backend shares: model initialization from a HuggingFace config, optimizer construction, learning-rate scheduler setup, weight sharding, and checkpoint management. The base engine does not implement a forward pass — that is left to the next level.The full API reference lives at
verl/workers/engine/base.py.Full engine classes (e.g.,
FSDPEngineWithLMHead, MegatronEngineWithLMHead) subclass the base engine and implement forward_step — the function that actually runs the model forward pass and computes the loss. There are two model types:Backend Details
- FSDP / FSDP2
- Megatron-LM (MCore)
- Automodel
The FSDP backend wraps any HuggingFace model with PyTorch Fully Sharded Data Parallel, making it trivially easy to add new models. FSDP2 is the next-generation implementation with improved memory management.Key features:Trade-offs:
- Works out of the box with any
transformersmodel - Sequence parallelism via DeepSpeed Ulysses (monkey-patched attention)
dtensor_weight_loaderfor efficient weight synchronization to the rollout engine (vLLM / SGLang) without redundant copies- Parameter and optimizer offload to CPU for memory-constrained setups
- Monkey patches are sensitive to
transformersversion changes - MoE models exhibit poor FSDP scaling due to load imbalance across expert shards
3D-HybridEngine and Weight Synchronization
When actor training and rollout generation are colocated on the same GPUs, verl must reshard model weights from the training layout (e.g., FSDP shards across all ranks) into the inference layout (e.g., vLLM tensor-parallel groups) between each training step. This is the 3D-HybridEngine. Rather than writing the full weight tensor to host memory and re-reading it, the hybrid engine exports per-tensor parameters directly viaengine.get_per_tensor_param() and passes them to the rollout engine’s update_weights() call in-process. For FSDP, this avoids any redundant copy; the parameters are gathered once and streamed directly into the vLLM/SGLang weight buffers.
CheckpointEngine.send_weights path instead.
Checkpoint System
Each engine is responsible for saving and loading the complete training state — model weights, optimizer state, and LR scheduler state. The checkpointing flow has two layers:- Intermediate sharded checkpoints — saved by the engine in its native sharded format (FSDP shards, Megatron tensor-parallel slices, etc.) at each
save_freqstep. These are fast to write and read. - HuggingFace export — engines that use HuggingFace model definitions can merge shards back to HF format using
transformersutilities. Engines based on custom model definitions (e.g., Megatron) must use a dedicated merge script (e.g.,mbridge).
EngineRegistry Dispatch Table
TheEngineRegistry selects the concrete engine class from the (model_type, backend, device) triple specified in your Hydra config:
| model_type | backend | device | Engine class |
|---|---|---|---|
language_model | fsdp / fsdp2 | cuda / npu | FSDPEngineWithLMHead |
language_model | megatron | cuda | MegatronEngineWithLMHead |
language_model | megatron | npu | MindspeedEngineWithLMHead |
language_model | mindspeed_megatron | npu | MindSpeedMegatronEngineWithLMHead |
language_model | automodel | cuda | AutomodelEngineWithLMHead |
language_model | veomni | cuda / npu | VeOmniEngineWithLMHead |
language_model | torchtitan | cuda / npu | TorchTitanEngineWithLMHead |
value_model | fsdp / fsdp2 | cuda / npu | FSDPEngineWithValueHead |
value_model | megatron | cuda | MegatronEngineWithValueHead |
Extending the Engine
Adding a New Backend
Create a new directory under
verl/workers/engine/<your_backend>/ and implement transformer_impl.py with a BaseEngine subclass. Register it with @EngineRegistry.register(model_type=..., backend=...).Add the engine config to the GSM8k SFT trainer script at
tests/special_e2e/sft/run_sft_engine_gsm8k.sh.TrainingWorker / ActorRolloutRefWorker) is already engine-agnostic — once your backend is registered and engine_config.strategy is set to its name, verl will use it automatically without any changes to training logic.
Adding a New Model Type
Currently,language_model (logit output) and value_model (scalar output) cover all standard RL use cases. Adding a new model type — for example, a model with simultaneous text and audio output like Qwen3-Omni — requires changes across the engine abstraction. Please open a discussion before proceeding.
Engine Workers
See how
TrainingWorker and ActorRolloutRefWorker wrap the engine and expose RPCs to the Ray trainer.Ray Trainer
Understand how
PPORayTrainer drives the full distributed training loop.