Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

verl (Volcano Engine Reinforcement Learning) is an open-source RL training library for large language models (LLMs), implementing the HybridFlow paper. It combines a hybrid programming model with state-of-the-art training and inference backends to make RL post-training both accessible and production-ready.

Installation

Set up verl with Docker or pip, choose your training and inference backends.

Quickstart

Run your first PPO training job on GSM8K in minutes.

HybridFlow Concepts

Understand the programming model that powers verl’s flexibility.

Algorithm Reference

Explore PPO, GRPO, DAPO, RLOO, and the full algorithm library.

Why verl?

verl is designed around three principles: flexibility, efficiency, and production-readiness. Flexible by design. The HybridFlow programming model lets you express complex RL training dataflows — rollout, advantage computation, policy updates — in a few lines of Python. Algorithms like PPO, GRPO, DAPO, RLOO, and ReMax are all first-class citizens. Extending to a new algorithm means implementing a new dataflow, not rewriting the infrastructure. High throughput. verl integrates SOTA training backends (FSDP, FSDP2, Megatron-LM) with SOTA inference backends (vLLM, SGLang). The 3D-HybridEngine reshards actor model weights between training and generation phases to eliminate memory redundancy and reduce communication overhead. Production-ready. verl scales from single-GPU experiments to clusters of hundreds of GPUs, supports models up to 671B parameters with expert parallelism, and runs on NVIDIA, AMD (ROCm), and Ascend hardware.

Key Features

Multiple RL Algorithms

PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO, and more — all configurable via YAML.

Flexible Training Backends

FSDP, FSDP2, and Megatron-LM for training, with automatic weight resharding between phases.

Best-in-Class Inference

vLLM and SGLang rollout backends with tensor parallelism, paged attention, and continuous batching.

Multi-turn & Agentic RL

Server-based async rollout, tool calling, LangGraph integration, and multi-turn conversation support.

VLM Support

Vision-language model RL with Qwen2.5-VL, Kimi-VL, and multi-modal reward functions.

Broad Hardware Support

NVIDIA GPUs, AMD ROCm (MI300X/MI325X/MI355X), and Ascend NPUs all supported.

Supported Algorithms

verl provides first-class implementations of the following RL algorithms:
AlgorithmDescription
PPOProximal Policy Optimization with GAE, critic model, KL control
GRPOGroup Relative Policy Optimization — critic-free, group-based advantage
DAPODecoupled Clip and Dynamic Sampling Policy Optimization — SOTA open-source RL
RLOOREINFORCE Leave-One-Out baseline
ReMaxReward-maximizing baseline with greedy rollouts
REINFORCE++Improved REINFORCE with variance reduction
SPINSelf-play fine-tuning
SPPOSelf-play preference optimization
DrGRPOGRPO variant eliminating length bias

Getting Started

1

Install verl

Pull the official Docker image or install from source. See Installation.
2

Prepare your dataset

Convert your dataset to Parquet format with prompt/answer fields. See Data Preparation.
3

Implement a reward function

Write a scoring function for your task, or use a pre-built one. See Reward Functions.
4

Launch training

Run PPO or GRPO with a YAML config. See Quickstart.

Community & Citation

verl is developed by the ByteDance Seed team and maintained by the verl community. It has been adopted by researchers at Alibaba, NVIDIA, UC Berkeley, Tsinghua University, and many others. If you use verl in your research, please cite HybridFlow: A Flexible and Efficient RLHF Framework. Join the community on GitHub, Slack, or WeChat.

Build docs developers (and LLMs) love