Reinforcement learning fine-tuning runs are long-lived workloads that can span hours or days across large GPU clusters. Hardware failures, preemptions, and transient errors are common realities at scale. verl’s checkpoint system lets you save the full training state — model weights, optimizer state, LR scheduler, and RNG states — and resume seamlessly from any saved step, minimizing wasted compute.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
What Gets Saved
Checkpoint contents are controlled by thecheckpoint.save_contents field on each model role (actor, critic, reference). The field accepts any subset of four values:
| Value | Description |
|---|---|
model | Framework-native sharded model weights. For FSDP: per-rank shards. For Megatron: HF format via mbridge, or Megatron dist checkpoint when use_dist_checkpointing=True. |
optimizer | Sharded optimizer state (Adam moments, etc.). |
extra | LR scheduler state, RNG states, and (for Megatron) the serialized TransformerConfig. |
hf_model | Full HuggingFace format weights consolidated on rank 0. Suitable for inference without any conversion step. |
For FSDP, the
model, optimizer, and extra contents are bound together — they are always saved and loaded as a unit. Always include all three to maintain a consistent and resumable checkpoint. Omitting optimizer means you cannot resume training (only the weights are useful for inference).Configuration
Save Frequency
How often (in training iterations) to write a checkpoint.
-1 disables periodic checkpointing; a final checkpoint is written at the end of training only.Root directory for checkpoint storage on local (or NFS-mounted) filesystems. Defaults to
checkpoints/{project_name}/{experiment_name}.Optional HDFS path for remote checkpoint storage. When set, checkpoints are also uploaded to HDFS after each local save.
Maximum number of actor checkpoint steps to retain. Older ones are deleted automatically.
null keeps all.Resuming from a Checkpoint
Resume behavior:
auto— automatically resume from the latest checkpoint indefault_local_dirif one exists; start fresh if notdisable— always start from scratch, ignoring any existing checkpointsresume_path— resume from the explicit path inresume_from_path
Explicit path to resume from when
resume_mode=resume_path. Point to a specific global_steps_N directory.load_contents:
FSDP Checkpoint Structure
For FSDP-based training, checkpoints are sharded per rank and laid out as follows:latest_checkpointed_iteration.txt file records the most recently saved step and is used by resume_mode=auto.
Megatron Checkpoint Structure
Megatron uses a more structured layout (schema v2) with ackpt_contents.json manifest:
ckpt_contents.json manifest is written last during saving, so its presence indicates a fully complete checkpoint. Example manifest:
Megatron Backend Options
Megatron model checkpoint behavior is controlled by two flags onactor_rollout_ref.actor.megatron:
When
True, the Megatron engine builds a mbridge instance that enables saving and loading model weights in HuggingFace format under model/huggingface/. Required for hf_model in save_contents.When
True, Megatron’s dist_checkpointing writes sharded model weights under model/dist_ckpt/. Can be used alongside use_mbridge=True to save both formats in one step.use_mbridge | use_dist_checkpointing | save_contents | On-disk Result |
|---|---|---|---|
| ✅ | ❌ | model | HF weights at model/huggingface/ |
| ✅ | ❌ | hf_model | Same HF tree (deduplicated) |
| ✅ | ❌ | model + hf_model | Same HF checkpoint saved once (deduplicated) |
| ❌ | ✅ | model | Megatron shards at model/dist_ckpt/ |
| ❌ | ✅ | hf_model | Error — mbridge required |
| ✅ | ✅ | model | Megatron shards at model/dist_ckpt/ only (no HF export) |
| ✅ | ✅ | model + hf_model | Both: model/dist_ckpt/ and model/huggingface/ |
Recommended Megatron Configurations
Default / production — keepuse_mbridge=True and save all three core contents:
Exporting to HuggingFace Format
verl providesverl.model_merger to convert sharded FSDP or Megatron checkpoints into a single HuggingFace model directory for inference.
FSDP Checkpoint Merge
Megatron Checkpoint Merge (Single Node)
Megatron Checkpoint Merge (Distributed, Multi-Node)
target_dir contains a standard HuggingFace model that can be loaded with AutoModelForCausalLM.from_pretrained().
Validate a Merged Checkpoint
Migrating Pre-v2 Megatron Checkpoints
Older verl releases produced a flatter checkpoint layout that is incompatible with the current v2 loader. Migrate an existing checkpoint with:Optimizer Checkpoint Format (Megatron)
Megatron optimizer checkpoints support two formats controlled bydist_ckpt_optim_fully_reshardable:
False(default, DP-reshardable): Faster and lower memory overhead. Supports resuming with different data parallel sizes.True(fully-reshardable): Slower but supports resuming with arbitrary parallelism configurations.