Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

Training large language models with RL algorithms like PPO and GRPO is memory-intensive: the actor, reference policy, critic, and reward model must all fit in GPU memory simultaneously alongside the rollout engine’s KV cache. verl provides several orthogonal memory optimization techniques that can be combined to scale RL training to very large models on modest hardware.
For most use cases, the recommended combination is FSDP2 + LoRA (rank ≥ 32) + use_remove_padding + use_dynamic_bsz. This gives the best throughput-to-memory ratio without sacrificing convergence quality.

LoRA for RL Fine-Tuning

LoRA (Low-Rank Adaptation) injects trainable low-rank matrices into the pre-trained weights of linear layers, dramatically reducing the number of trainable parameters. In RL training this enables:
  • Fine-tuning 70B+ models on 8×80 GB GPUs
  • Larger effective batch sizes due to reduced optimizer state memory
  • Simpler deployment: only LoRA adapter weights need to be saved and served
  • Compatibility with multi-adapter serving techniques like SLoRA and CCoE
LoRA introduces a tradeoff: very small ranks hurt convergence. A lora_rank of 32 is recommended as a minimum. For a 0.5B model at rank 32, convergence speed and final performance are nearly identical to full fine-tuning. For a 32B model, rank 128 achieves the same result. Increase the learning rate by approximately 10× when using LoRA.

FSDP Backend Configuration

LoRA is supported via HuggingFace PEFT with both FSDP and FSDP2 backends, and works with both vLLM and SGLang rollout backends.
actor_rollout_ref:
  model:
    lora_rank: 32           # minimum recommended; increase for larger models
    lora_alpha: 32          # scaling factor
    target_modules: "all-linear"  # or specific list: ["q_proj", "v_proj", "k_proj", "o_proj"]

  rollout:
    load_format: "safetensors"  # required: enables rollout engine to load base model
    layered_summon: true        # gather FSDP shards per-layer during sync; reduces peak memory
Optional settings:
actor_rollout_ref:
  model:
    lora_adapter_path: /path/to/adapter  # load a pre-trained adapter for multi-stage training
    lora:
      merge: false  # false = sync adapter deltas natively (recommended for vLLM)
                    # true = merge into base weights before sync (required for SGLang currently)
    use_shm: true   # preload model into /dev/shm for faster weight loading
Reference configuration for Qwen2.5-72B on 8×80 GB GPUs:
data.train_batch_size=64 \
actor_rollout_ref.model.use_shm=True \
actor_rollout_ref.model.lora_rank=32 \
actor_rollout_ref.model.lora_alpha=32 \
actor_rollout_ref.model.target_modules=all-linear \
actor_rollout_ref.actor.optim.lr=3e-5 \
actor_rollout_ref.actor.fsdp_config.fsdp_size=8 \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.rollout.max_num_seqs=64 \
actor_rollout_ref.rollout.max_model_len=1536 \
actor_rollout_ref.rollout.max_num_batched_tokens=1536 \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.rollout.layered_summon=True \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=1

Megatron Backend Configuration

The Megatron backend uses Megatron-Bridge’s native LoRA implementation (not HuggingFace PEFT). FSDP-specific keys like lora_rank, lora_alpha at model.* level are ignored — use the lora.* namespace instead. Requires Megatron-Bridge ≥ 0.2.0.
actor_rollout_ref:
  actor:
    megatron:
      use_mbridge: true
      vanilla_mbridge: false
  model:
    lora:
      type: lora              # "lora", "vlm_lora", "canonical_lora", or "dora"
      rank: 32
      alpha: 32
      dropout: 0.0
      dropout_position: pre
      merge: false            # false = load separate adapter deltas
      target_modules:
        - linear_qkv          # fused Q/K/V projection
        - linear_proj         # self-attention output projection
        - linear_fc1          # first MLP layer
        - linear_fc2          # second MLP layer
      exclude_modules: []
      lora_A_init_method: xavier
      lora_B_init_method: zero
      adapter_path: null      # set to load a pre-trained adapter
For MLA architectures (e.g. DeepSeek), replace linear_qkv with ["linear_kv_down_proj", "linear_kv_up_proj", "linear_q_down_proj", "linear_q_up_proj", "linear_q_proj"]. MoE routers are excluded from LoRA by default.

Example Scripts

FSDP LoRA Training

examples/tuning/lora/run_qwen3_8b_fsdp.sh — GRPO with LoRA from scratch on Qwen3-8B

FSDP LoRA from Adapter

examples/tuning/lora/run_qwen3_8b_from_adapter_fsdp.sh — resume from a saved adapter

VLM LoRA

examples/tuning/lora/run_qwen2_5_vl_7b_fsdp.sh — LoRA for vision-language models

Megatron MoE LoRA

examples/tuning/lora/run_qwen3_30b_a3b_megatron.sh — LoRA with MoE Megatron backend

FSDP2 Memory Optimization

FSDP2 is the recommended training strategy for new projects. According to PyTorch TorchTitan benchmarks it provides 7% lower GPU memory usage, 1.5% throughput improvement with BF16, and better composability with DTensor and torch.compile. Enable FSDP2 across all training roles:
actor_rollout_ref:
  ref:
    strategy: fsdp2
  actor:
    strategy: fsdp2
critic:
  strategy: fsdp2

CPU Offloading (FSDP2 only)

FSDP2 CPU offloading is compatible with gradient accumulation (unlike FSDP1). Enable parameter and/or optimizer offload to free GPU memory:
actor_rollout_ref:
  actor:
    fsdp_config:
      param_offload: true
      optimizer_offload: true
  ref:
    fsdp_config:
      param_offload: true

Activation Offloading

For further memory reduction during forward passes, enable activation offloading together with gradient checkpointing (FSDP backend only):
actor_rollout_ref:
  model:
    enable_gradient_checkpointing: true
    enable_activation_offload: true
critic:
  model:
    enable_gradient_checkpointing: true
    enable_activation_offload: true

Sequence Packing (Remove Padding)

Variable-length sequences waste compute on padding tokens. Enabling use_remove_padding packs sequences into contiguous token streams, removing all padding overhead:
actor_rollout_ref:
  model:
    use_remove_padding: true
critic:
  model:
    use_remove_padding: true
Supported for Llama, Mistral, Gemma (v1), and Qwen-based models (via the transformers sequence packing implementation). To validate support for a new model architecture, run:
pytest -s tests/models/test_transformer.py
If the test passes, add the model to verl/models/registry.py.

Dynamic Batch Size

Instead of specifying a fixed number of samples per micro-batch, use_dynamic_bsz packs a configurable token budget per GPU per step. This makes micro-batch sizes adaptive to sequence length, improving throughput for variable-length workloads:
actor_rollout_ref:
  actor:
    use_dynamic_bsz: true
    ppo_max_token_len_per_gpu: 32000  # at least 2× (max_prompt + max_response)
  ref:
    log_prob_max_token_len_per_gpu: 48000  # forward-only; can be larger
critic:
  use_dynamic_bsz: true
  ppo_max_token_len_per_gpu: 64000  # critic can use larger limits than actor
When use_dynamic_bsz=true, you do not need to tune *micro_batch_size_per_gpu — tune the *max_token_len_per_gpu parameters instead.

Ulysses Sequence Parallelism

For long-context training where sequences do not fit on a single GPU, Ulysses sequence parallelism splits each sequence across multiple GPUs for attention computation:
actor_rollout_ref:
  actor:
    ulysses_sequence_parallel_size: 2  # split each sequence across 2 GPUs
  ref:
    ulysses_sequence_parallel_size: 2
critic:
  ulysses_sequence_parallel_size: 2
Different model roles can use different ulysses_sequence_parallel_size values. For sequences longer than 32k tokens, also reduce *micro_batch_size_per_gpu and *max_token_len_per_gpu to avoid OOM.

Memory Optimization for Entropy Computation

The logits tensor (shape [bsz×seq_len, vocab_size]) consumes significant memory during entropy computation. Two options reduce this peak: Chunked entropy computation (forward pass only):
actor_rollout_ref:
  ref:
    entropy_from_logits_with_chunking: true
    entropy_from_logits_chunk_size: 2048
Entropy checkpointing (recompute during backward, for training):
actor_rollout_ref:
  actor:
    entropy_checkpointing: true
Note that standard enable_gradient_checkpointing does not apply to entropy calculations, so entropy_checkpointing is needed separately for training-phase memory reduction.

Build docs developers (and LLMs) love