Training large language models with RL algorithms like PPO and GRPO is memory-intensive: the actor, reference policy, critic, and reward model must all fit in GPU memory simultaneously alongside the rollout engine’s KV cache. verl provides several orthogonal memory optimization techniques that can be combined to scale RL training to very large models on modest hardware.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
LoRA for RL Fine-Tuning
LoRA (Low-Rank Adaptation) injects trainable low-rank matrices into the pre-trained weights of linear layers, dramatically reducing the number of trainable parameters. In RL training this enables:- Fine-tuning 70B+ models on 8×80 GB GPUs
- Larger effective batch sizes due to reduced optimizer state memory
- Simpler deployment: only LoRA adapter weights need to be saved and served
- Compatibility with multi-adapter serving techniques like SLoRA and CCoE
LoRA introduces a tradeoff: very small ranks hurt convergence. A
lora_rank of 32 is recommended as a minimum. For a 0.5B model at rank 32, convergence speed and final performance are nearly identical to full fine-tuning. For a 32B model, rank 128 achieves the same result. Increase the learning rate by approximately 10× when using LoRA.FSDP Backend Configuration
LoRA is supported via HuggingFace PEFT with both FSDP and FSDP2 backends, and works with both vLLM and SGLang rollout backends.Megatron Backend Configuration
The Megatron backend uses Megatron-Bridge’s native LoRA implementation (not HuggingFace PEFT). FSDP-specific keys likelora_rank, lora_alpha at model.* level are ignored — use the lora.* namespace instead.
Requires Megatron-Bridge ≥ 0.2.0.
For MLA architectures (e.g. DeepSeek), replace
linear_qkv with ["linear_kv_down_proj", "linear_kv_up_proj", "linear_q_down_proj", "linear_q_up_proj", "linear_q_proj"]. MoE routers are excluded from LoRA by default.Example Scripts
FSDP LoRA Training
examples/tuning/lora/run_qwen3_8b_fsdp.sh — GRPO with LoRA from scratch on Qwen3-8BFSDP LoRA from Adapter
examples/tuning/lora/run_qwen3_8b_from_adapter_fsdp.sh — resume from a saved adapterVLM LoRA
examples/tuning/lora/run_qwen2_5_vl_7b_fsdp.sh — LoRA for vision-language modelsMegatron MoE LoRA
examples/tuning/lora/run_qwen3_30b_a3b_megatron.sh — LoRA with MoE Megatron backendFSDP2 Memory Optimization
FSDP2 is the recommended training strategy for new projects. According to PyTorch TorchTitan benchmarks it provides 7% lower GPU memory usage, 1.5% throughput improvement with BF16, and better composability with DTensor andtorch.compile.
Enable FSDP2 across all training roles:
CPU Offloading (FSDP2 only)
FSDP2 CPU offloading is compatible with gradient accumulation (unlike FSDP1). Enable parameter and/or optimizer offload to free GPU memory:Activation Offloading
For further memory reduction during forward passes, enable activation offloading together with gradient checkpointing (FSDP backend only):Sequence Packing (Remove Padding)
Variable-length sequences waste compute on padding tokens. Enablinguse_remove_padding packs sequences into contiguous token streams, removing all padding overhead:
verl/models/registry.py.
Dynamic Batch Size
Instead of specifying a fixed number of samples per micro-batch,use_dynamic_bsz packs a configurable token budget per GPU per step. This makes micro-batch sizes adaptive to sequence length, improving throughput for variable-length workloads:
use_dynamic_bsz=true, you do not need to tune *micro_batch_size_per_gpu — tune the *max_token_len_per_gpu parameters instead.
Ulysses Sequence Parallelism
For long-context training where sequences do not fit on a single GPU, Ulysses sequence parallelism splits each sequence across multiple GPUs for attention computation:ulysses_sequence_parallel_size values. For sequences longer than 32k tokens, also reduce *micro_batch_size_per_gpu and *max_token_len_per_gpu to avoid OOM.
Memory Optimization for Entropy Computation
The logits tensor (shape[bsz×seq_len, vocab_size]) consumes significant memory during entropy computation. Two options reduce this peak:
Chunked entropy computation (forward pass only):
enable_gradient_checkpointing does not apply to entropy calculations, so entropy_checkpointing is needed separately for training-phase memory reduction.