verl (Volcano Engine Reinforcement Learning) is an open-source RL training library for large language models (LLMs), implementing the HybridFlow paper. It combines a hybrid programming model with state-of-the-art training and inference backends to make RL post-training both accessible and production-ready.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
Installation
Set up verl with Docker or pip, choose your training and inference backends.
Quickstart
Run your first PPO training job on GSM8K in minutes.
HybridFlow Concepts
Understand the programming model that powers verl’s flexibility.
Algorithm Reference
Explore PPO, GRPO, DAPO, RLOO, and the full algorithm library.
Why verl?
verl is designed around three principles: flexibility, efficiency, and production-readiness. Flexible by design. The HybridFlow programming model lets you express complex RL training dataflows — rollout, advantage computation, policy updates — in a few lines of Python. Algorithms like PPO, GRPO, DAPO, RLOO, and ReMax are all first-class citizens. Extending to a new algorithm means implementing a new dataflow, not rewriting the infrastructure. High throughput. verl integrates SOTA training backends (FSDP, FSDP2, Megatron-LM) with SOTA inference backends (vLLM, SGLang). The 3D-HybridEngine reshards actor model weights between training and generation phases to eliminate memory redundancy and reduce communication overhead. Production-ready. verl scales from single-GPU experiments to clusters of hundreds of GPUs, supports models up to 671B parameters with expert parallelism, and runs on NVIDIA, AMD (ROCm), and Ascend hardware.Key Features
Multiple RL Algorithms
PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO, and more — all configurable via YAML.
Flexible Training Backends
FSDP, FSDP2, and Megatron-LM for training, with automatic weight resharding between phases.
Best-in-Class Inference
vLLM and SGLang rollout backends with tensor parallelism, paged attention, and continuous batching.
Multi-turn & Agentic RL
Server-based async rollout, tool calling, LangGraph integration, and multi-turn conversation support.
VLM Support
Vision-language model RL with Qwen2.5-VL, Kimi-VL, and multi-modal reward functions.
Broad Hardware Support
NVIDIA GPUs, AMD ROCm (MI300X/MI325X/MI355X), and Ascend NPUs all supported.
Supported Algorithms
verl provides first-class implementations of the following RL algorithms:| Algorithm | Description |
|---|---|
| PPO | Proximal Policy Optimization with GAE, critic model, KL control |
| GRPO | Group Relative Policy Optimization — critic-free, group-based advantage |
| DAPO | Decoupled Clip and Dynamic Sampling Policy Optimization — SOTA open-source RL |
| RLOO | REINFORCE Leave-One-Out baseline |
| ReMax | Reward-maximizing baseline with greedy rollouts |
| REINFORCE++ | Improved REINFORCE with variance reduction |
| SPIN | Self-play fine-tuning |
| SPPO | Self-play preference optimization |
| DrGRPO | GRPO variant eliminating length bias |
Getting Started
Install verl
Pull the official Docker image or install from source. See Installation.
Prepare your dataset
Convert your dataset to Parquet format with prompt/answer fields. See Data Preparation.
Implement a reward function
Write a scoring function for your task, or use a pre-built one. See Reward Functions.
Launch training
Run PPO or GRPO with a YAML config. See Quickstart.