Checkpointing and Fault Tolerance in verl Training

Reinforcement learning fine-tuning runs are long-lived workloads that can span hours or days across large GPU clusters. Hardware failures, preemptions, and transient errors are common realities at scale. verl’s checkpoint system lets you save the full training state — model weights, optimizer state, LR scheduler, and RNG states — and resume seamlessly from any saved step, minimizing wasted compute.

What Gets Saved

Checkpoint contents are controlled by the checkpoint.save_contents field on each model role (actor, critic, reference). The field accepts any subset of four values:

Value	Description
`model`	Framework-native sharded model weights. For FSDP: per-rank shards. For Megatron: HF format via mbridge, or Megatron dist checkpoint when `use_dist_checkpointing=True`.
`optimizer`	Sharded optimizer state (Adam moments, etc.).
`extra`	LR scheduler state, RNG states, and (for Megatron) the serialized `TransformerConfig`.
`hf_model`	Full HuggingFace format weights consolidated on rank 0. Suitable for inference without any conversion step.

For FSDP, the model, optimizer, and extra contents are bound together — they are always saved and loaded as a unit. Always include all three to maintain a consistent and resumable checkpoint. Omitting optimizer means you cannot resume training (only the weights are useful for inference).

Configuration

Save Frequency

trainer:
  save_freq: 100             # save every 100 training iterations
  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
  default_hdfs_dir: null     # optional: hdfs://path/for/remote/storage

actor_rollout_ref:
  actor:
    checkpoint:
      save_contents: ['model', 'optimizer', 'extra']
      load_contents: ['model', 'optimizer', 'extra']

critic:
  checkpoint:
    save_contents: ['model', 'optimizer', 'extra']
    load_contents: ['model', 'optimizer', 'extra']

trainer.save_freq

int

default:"-1"

How often (in training iterations) to write a checkpoint. -1 disables periodic checkpointing; a final checkpoint is written at the end of training only.

trainer.default_local_dir

string

Root directory for checkpoint storage on local (or NFS-mounted) filesystems. Defaults to checkpoints/{project_name}/{experiment_name}.

trainer.default_hdfs_dir

string

Optional HDFS path for remote checkpoint storage. When set, checkpoints are also uploaded to HDFS after each local save.

trainer.max_actor_ckpt_to_keep

int

Maximum number of actor checkpoint steps to retain. Older ones are deleted automatically. null keeps all.

Resuming from a Checkpoint

trainer:
  resume_mode: auto          # auto | disable | resume_path
  resume_from_path: null     # set this when resume_mode: resume_path

actor_rollout_ref:
  actor:
    checkpoint:
      load_contents: ['model', 'optimizer', 'extra']

trainer.resume_mode

string

default:"auto"

Resume behavior:

auto — automatically resume from the latest checkpoint in default_local_dir if one exists; start fresh if not
disable — always start from scratch, ignoring any existing checkpoints
resume_path — resume from the explicit path in resume_from_path

trainer.resume_from_path

string

Explicit path to resume from when resume_mode=resume_path. Point to a specific global_steps_N directory.

To resume only model weights (e.g., after changing optimizer hyperparameters), override load_contents:

python -m verl.trainer.main_ppo \
    trainer.resume_mode=resume_path \
    trainer.resume_from_path=checkpoints/my_project/my_run/global_steps_500 \
    "actor_rollout_ref.actor.checkpoint.load_contents=['model']"

FSDP Checkpoint Structure

For FSDP-based training, checkpoints are sharded per rank and laid out as follows:

checkpoints/{project_name}/{experiment_name}/
├── global_steps_100/
│   ├── actor/
│   │   ├── huggingface/          # config.json, tokenizer files; full HF weights if hf_model is saved
│   │   ├── fsdp_config.json      # world_size and FSDP version metadata
│   │   ├── model_world_size_8_rank_0.pt
│   │   ├── model_world_size_8_rank_1.pt
│   │   ├── ...
│   │   ├── optim_world_size_8_rank_0.pt
│   │   ├── ...
│   │   └── extra_state_world_size_8_rank_0.pt
│   └── critic/
│       ├── huggingface/
│       ├── fsdp_config.json
│       ├── model_world_size_8_rank_0.pt
│       └── ...
└── latest_checkpointed_iteration.txt

All model shards, optimizer states, and extra states are stored in a distributed, sharded format. The latest_checkpointed_iteration.txt file records the most recently saved step and is used by resume_mode=auto.

Megatron Checkpoint Structure

Megatron uses a more structured layout (schema v2) with a ckpt_contents.json manifest:

checkpoints/{project_name}/{experiment_name}/
├── global_steps_100/
│   ├── actor/
│   │   ├── ckpt_contents.json        # manifest: maps logical names to on-disk paths
│   │   ├── transformer_config.json   # serialized Megatron TransformerConfig
│   │   ├── model/
│   │   │   ├── huggingface/          # HF weights (requires use_mbridge=True)
│   │   │   └── dist_ckpt/           # Megatron shards (use_dist_checkpointing=True)
│   │   ├── optimizer/
│   │   │   └── dist_ckpt/           # optimizer + LR scheduler shards
│   │   └── extra/
│   │       └── dist_ckpt/           # RNG state shards
│   └── critic/                      # same layout as actor
└── latest_checkpointed_iteration.txt

The ckpt_contents.json manifest is written last during saving, so its presence indicates a fully complete checkpoint. Example manifest:

{
  "schema_version": 2,
  "framework": "megatron",
  "role": "actor",
  "global_step": 100,
  "save_contents": ["model", "optimizer", "extra"],
  "contents": {
    "model":       {"path": "model/huggingface", "format": "huggingface"},
    "optimizer":   {"path": "optimizer/dist_ckpt", "format": "megatron_dist_checkpoint"},
    "lr_scheduler":{"path": "optimizer/dist_ckpt", "format": "megatron_dist_checkpoint"},
    "rng_state":   {"path": "extra/dist_ckpt",     "format": "megatron_dist_checkpoint"}
  }
}

Megatron Backend Options

Megatron model checkpoint behavior is controlled by two flags on actor_rollout_ref.actor.megatron:

actor_rollout_ref.actor.megatron.use_mbridge

boolean

default:"True"

When True, the Megatron engine builds a mbridge instance that enables saving and loading model weights in HuggingFace format under model/huggingface/. Required for hf_model in save_contents.

actor_rollout_ref.actor.megatron.use_dist_checkpointing

boolean

default:"False"

When True, Megatron’s dist_checkpointing writes sharded model weights under model/dist_ckpt/. Can be used alongside use_mbridge=True to save both formats in one step.

The two flags are independent and can be combined. The table below summarizes behavior:

`use_mbridge`	`use_dist_checkpointing`	`save_contents`	On-disk Result
✅	❌	`model`	HF weights at `model/huggingface/`
✅	❌	`hf_model`	Same HF tree (deduplicated)
✅	❌	`model` + `hf_model`	Same HF checkpoint saved once (deduplicated)
❌	✅	`model`	Megatron shards at `model/dist_ckpt/`
❌	✅	`hf_model`	Error — mbridge required
✅	✅	`model`	Megatron shards at `model/dist_ckpt/` only (no HF export)
✅	✅	`model` + `hf_model`	Both: `model/dist_ckpt/` and `model/huggingface/`

hf_model in save_contents requires use_mbridge=True. Without it, the checkpoint manager will raise an error at save time. If you need to save HF-format weights without mbridge, use the verl.model_merger tool after training instead.

Recommended Megatron Configurations

Default / production — keep use_mbridge=True and save all three core contents:

actor_rollout_ref:
  actor:
    megatron:
      use_mbridge: True
    checkpoint:
      save_contents: ['model', 'optimizer', 'extra']

HF-only export — only need a deployable checkpoint, not resumable state:

actor_rollout_ref:
  actor:
    checkpoint:
      save_contents: ['hf_model']

Hybrid (resume + HF export) — write both Megatron shards and HF weights in one step:

actor_rollout_ref:
  actor:
    megatron:
      use_mbridge: True
      use_dist_checkpointing: True
    checkpoint:
      save_contents: ['model', 'hf_model', 'optimizer', 'extra']

Exporting to HuggingFace Format

verl provides verl.model_merger to convert sharded FSDP or Megatron checkpoints into a single HuggingFace model directory for inference.

FSDP Checkpoint Merge

python -m verl.model_merger merge \
    --backend fsdp \
    --local_dir checkpoints/my_project/my_run/global_steps_500/actor \
    --target_dir /path/to/merged_hf_model

Megatron Checkpoint Merge (Single Node)

python -m verl.model_merger merge \
    --backend megatron \
    --tie-word-embedding \
    --local_dir checkpoints/my_project/my_run/global_steps_500/actor \
    --target_dir /path/to/merged_hf_model

Megatron Checkpoint Merge (Distributed, Multi-Node)

torchrun --nproc_per_node 1 --nnodes 8 --node_rank ${RANK} \
    -m verl.model_merger merge \
    --backend megatron \
    --tie-word-embedding \
    --local_dir checkpoints/my_project/my_run/global_steps_500/actor \
    --target_dir /path/to/merged_hf_model

Once merged, the target_dir contains a standard HuggingFace model that can be loaded with AutoModelForCausalLM.from_pretrained().

Validate a Merged Checkpoint

python -m verl.model_merger test \
    --backend fsdp \
    --local_dir checkpoints/.../global_steps_500/actor \
    --reference_dir /path/to/original_hf_model

Migrating Pre-v2 Megatron Checkpoints

Older verl releases produced a flatter checkpoint layout that is incompatible with the current v2 loader. Migrate an existing checkpoint with:

# Migrate a single step
python scripts/migrate_megatron_checkpoint_layout.py \
    --checkpoint /path/to/global_step_100/actor

# Migrate all steps under a run
python scripts/migrate_megatron_checkpoint_layout.py \
    --checkpoint-root /path/to/run \
    --all-steps

The migration uses hard links by default, so it is fast and does not duplicate disk space.

Optimizer Checkpoint Format (Megatron)

Megatron optimizer checkpoints support two formats controlled by dist_ckpt_optim_fully_reshardable:

False (default, DP-reshardable): Faster and lower memory overhead. Supports resuming with different data parallel sizes.
True (fully-reshardable): Slower but supports resuming with arbitrary parallelism configurations.

When dist_ckpt_optim_fully_reshardable=True, optimizer states are temporarily gathered on data-parallel rank 0 before being re-sharded for storage. For large models this intermediate aggregation can cause CPU OOM. Use the default DP-reshardable format unless you specifically need to change parallelism on resume.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Checkpointing and Fault Tolerance in verl Training

What Gets Saved

Configuration

Save Frequency

Resuming from a Checkpoint

FSDP Checkpoint Structure

Megatron Checkpoint Structure

Megatron Backend Options

Recommended Megatron Configurations

Exporting to HuggingFace Format

FSDP Checkpoint Merge

Megatron Checkpoint Merge (Single Node)

Megatron Checkpoint Merge (Distributed, Multi-Node)

Validate a Merged Checkpoint

Migrating Pre-v2 Megatron Checkpoints

Optimizer Checkpoint Format (Megatron)

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​What Gets Saved

​Configuration

​Save Frequency

​Resuming from a Checkpoint

​FSDP Checkpoint Structure

​Megatron Checkpoint Structure

​Megatron Backend Options

​Recommended Megatron Configurations

​Exporting to HuggingFace Format

​FSDP Checkpoint Merge

​Megatron Checkpoint Merge (Single Node)

​Megatron Checkpoint Merge (Distributed, Multi-Node)

​Validate a Merged Checkpoint

​Migrating Pre-v2 Megatron Checkpoints

​Optimizer Checkpoint Format (Megatron)

Build docs developers (and LLMs) love

What Gets Saved

Configuration

Save Frequency

Resuming from a Checkpoint

FSDP Checkpoint Structure

Megatron Checkpoint Structure

Megatron Backend Options

Recommended Megatron Configurations

Exporting to HuggingFace Format

FSDP Checkpoint Merge

Megatron Checkpoint Merge (Single Node)

Megatron Checkpoint Merge (Distributed, Multi-Node)

Validate a Merged Checkpoint

Migrating Pre-v2 Megatron Checkpoints

Optimizer Checkpoint Format (Megatron)