verl Model Engine: FSDP, FSDP2, and Megatron-LM Backends

verl’s model engine layer provides a clean abstraction between the training logic and the underlying distributed parallelism strategy. Whether a worker uses FSDP, FSDP2, or Megatron-LM, the training code above it stays the same — it calls the same train_batch, infer_batch, save_checkpoint, and load_checkpoint APIs regardless of which backend is active. This engine-agnostic design is what lets verl support a wide range of models and scales from a single unified training loop.

Backend Support Matrix

Each backend makes different trade-offs between ease of use, model coverage, and raw scalability. The table below summarizes the current state:

Backend	Model Support	Scalability	Model Definition	Notes
FSDP + Ulysses	Any HuggingFace model from day one	Dense models: good. MoE models: poor	HuggingFace + monkey patch	Monkey patches can be impacted by `transformers` version upgrades
MCore (Megatron)	Limited set of models	Best — full 3D parallelism	`GPTModel` (one model definition for all)	Supporting new models requires significant effort

verl monkey-patches the attention function in HuggingFace models to add DeepSpeed Ulysses sequence parallelism support. VLM models also receive a monkey patch that enables FSDP to handle mixed batches containing both image and text-only samples.

Class Hierarchy

All workers and trainers in verl run in SPMD mode — every GPU rank executes the same code. The SFT/DPO/RM trainers are invoked directly via torchrun; the Actor/Critic workers are wrapped by a RayWorkerGroup and expose their APIs as RPCs to the single controller. The engine stack has three levels:

Base Engine Level

The base engine (BaseEngine) implements the foundational infrastructure that every backend shares: model initialization from a HuggingFace config, optimizer construction, learning-rate scheduler setup, weight sharding, and checkpoint management. The base engine does not implement a forward pass — that is left to the next level.

The full API reference lives at verl/workers/engine/base.py.

Full Engine Level

Full engine classes (e.g., FSDPEngineWithLMHead, MegatronEngineWithLMHead) subclass the base engine and implement forward_step — the function that actually runs the model forward pass and computes the loss. There are two model types:

Model typeInputOutputLanguage modeltext / image / video / audio tokenslogits for next token predictionValue modeltext / image / video / audio tokensscalar value estimate

Worker / SPMD Trainer Level

The worker layer (TrainingWorker, ActorRolloutRefWorker) is engine-agnostic. It implements training logic — PPO epochs, gradient accumulation, actor-rollout weight sync — using only the abstract engine APIs. The concrete backend is injected at construction time via EngineRegistry.

Backend Details

FSDP / FSDP2
Megatron-LM (MCore)
Automodel

The FSDP backend wraps any HuggingFace model with PyTorch Fully Sharded Data Parallel, making it trivially easy to add new models. FSDP2 is the next-generation implementation with improved memory management.Key features:

Works out of the box with any transformers model
Sequence parallelism via DeepSpeed Ulysses (monkey-patched attention)
dtensor_weight_loader for efficient weight synchronization to the rollout engine (vLLM / SGLang) without redundant copies
Parameter and optimizer offload to CPU for memory-constrained setups

Configuration:

actor_rollout_ref:
  actor:
    strategy: fsdp      # or fsdp2
  ref:
    strategy: fsdp
critic:
  strategy: fsdp

Trade-offs:

Monkey patches are sensitive to transformers version changes
MoE models exhibit poor FSDP scaling due to load imbalance across expert shards

The Megatron-LM backend uses NVIDIA’s MCore library (GPTModel) and provides the best raw scalability for large dense and MoE models.Key features:

Full 3D parallelism: tensor parallelism (TP), pipeline parallelism (PP), expert parallelism (EP)
Validated on models up to 671B parameters (DeepSeek-V3, Qwen3-MoE)
NPU support via the MindSpeed adapter (mindspeed_megatron)
make_nd_compute_dataproto_dispatch_fn surfaces the PP dimension as an extra DP axis to the single controller, so no backend-specific dispatch logic is needed at the trainer level

Configuration:

actor_rollout_ref:
  actor:
    strategy: megatron
critic:
  strategy: megatron

Trade-offs:

Only a limited set of models have a MCore implementation
Adding a new model requires writing a GPTModel-compatible definition

The Automodel backend delegates model building, parallelization, optimizer sharding, and checkpointing to nemo_automodel while using verl’s training loop, data pipeline, and loss functions.Requirements: Automodel r0.3.0, transformers v5.0.0Key features:

FSDP2 and Tensor Parallelism (TP) out of the box
Native MoE support with Expert Parallelism (EP) via DeepEP
TransformerEngine (TE) integration for optimized attention, linear layers, and RMSNorm
No checkpoint conversion needed — loads any HuggingFace model directly

Limitation: Pipeline parallelism is not yet supported.

actor_rollout_ref:
  actor:
    strategy: automodel

3D-HybridEngine and Weight Synchronization

When actor training and rollout generation are colocated on the same GPUs, verl must reshard model weights from the training layout (e.g., FSDP shards across all ranks) into the inference layout (e.g., vLLM tensor-parallel groups) between each training step. This is the 3D-HybridEngine. Rather than writing the full weight tensor to host memory and re-reading it, the hybrid engine exports per-tensor parameters directly via engine.get_per_tensor_param() and passes them to the rollout engine’s update_weights() call in-process. For FSDP, this avoids any redundant copy; the parameters are gathered once and streamed directly into the vLLM/SGLang weight buffers.

# Inside ActorRolloutRefWorker.update_weights (naive/colocated sync)
per_tensor_param, peft_config = self.actor.engine.get_per_tensor_param(
    layered_summon=self.layered_summon,
    base_sync_done=True,
)
await self.rollout.update_weights(
    per_tensor_param,
    peft_config=peft_config,
    base_sync_done=True,
    global_steps=global_steps,
)

For disaggregated async training (trainer and rollout on separate node pools), weights are transferred via the CheckpointEngine.send_weights path instead.

Checkpoint System

Each engine is responsible for saving and loading the complete training state — model weights, optimizer state, and LR scheduler state. The checkpointing flow has two layers:

Intermediate sharded checkpoints — saved by the engine in its native sharded format (FSDP shards, Megatron tensor-parallel slices, etc.) at each save_freq step. These are fast to write and read.
HuggingFace export — engines that use HuggingFace model definitions can merge shards back to HF format using transformers utilities. Engines based on custom model definitions (e.g., Megatron) must use a dedicated merge script (e.g., mbridge).

The engine constructs the model from a HuggingFace config, loads weights from a HuggingFace checkpoint, and then hands off to the sharding strategy.

EngineRegistry Dispatch Table

The EngineRegistry selects the concrete engine class from the (model_type, backend, device) triple specified in your Hydra config:

model_type	backend	device	Engine class
`language_model`	`fsdp` / `fsdp2`	`cuda` / `npu`	`FSDPEngineWithLMHead`
`language_model`	`megatron`	`cuda`	`MegatronEngineWithLMHead`
`language_model`	`megatron`	`npu`	`MindspeedEngineWithLMHead`
`language_model`	`mindspeed_megatron`	`npu`	`MindSpeedMegatronEngineWithLMHead`
`language_model`	`automodel`	`cuda`	`AutomodelEngineWithLMHead`
`language_model`	`veomni`	`cuda` / `npu`	`VeOmniEngineWithLMHead`
`language_model`	`torchtitan`	`cuda` / `npu`	`TorchTitanEngineWithLMHead`
`value_model`	`fsdp` / `fsdp2`	`cuda` / `npu`	`FSDPEngineWithValueHead`
`value_model`	`megatron`	`cuda`	`MegatronEngineWithValueHead`

Extending the Engine

Adding a New Backend

Create the engine folder

Create a new directory under verl/workers/engine/<your_backend>/ and implement transformer_impl.py with a BaseEngine subclass. Register it with @EngineRegistry.register(model_type=..., backend=...).

Add to the SFT test harness

Add the engine config to the GSM8k SFT trainer script at tests/special_e2e/sft/run_sft_engine_gsm8k.sh.

Run the correctness tests

Invoke tests/special_e2e/sft/test_sft_engine_all.sh. This script runs all backends and configurations, comparing loss and gradient norm at step 1 to verify numerical correctness.

The worker layer (TrainingWorker / ActorRolloutRefWorker) is already engine-agnostic — once your backend is registered and engine_config.strategy is set to its name, verl will use it automatically without any changes to training logic.

Adding a New Model Type

Currently, language_model (logit output) and value_model (scalar output) cover all standard RL use cases. Adding a new model type — for example, a model with simultaneous text and audio output like Qwen3-Omni — requires changes across the engine abstraction. Please open a discussion before proceeding.

Engine Workers

See how TrainingWorker and ActorRolloutRefWorker wrap the engine and expose RPCs to the Ray trainer.

Ray Trainer

Understand how PPORayTrainer drives the full distributed training loop.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

verl Model Engine: FSDP, FSDP2, and Megatron-LM Backends

Backend Support Matrix

Class Hierarchy

Backend Details

3D-HybridEngine and Weight Synchronization

Checkpoint System

EngineRegistry Dispatch Table

Extending the Engine

Adding a New Backend

Adding a New Model Type

Engine Workers

Ray Trainer

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Backend Support Matrix

​Class Hierarchy

​Backend Details

​3D-HybridEngine and Weight Synchronization

​Checkpoint System

​EngineRegistry Dispatch Table

​Extending the Engine

​Adding a New Backend

​Adding a New Model Type

Engine Workers

Ray Trainer

Build docs developers (and LLMs) love

Backend Support Matrix

Class Hierarchy

Backend Details

3D-HybridEngine and Weight Synchronization

Checkpoint System

EngineRegistry Dispatch Table

Extending the Engine

Adding a New Backend

Adding a New Model Type