Agentic RL extends standard single-turn RL training by letting the model interact with tools and environments over multiple conversation turns before receiving a final reward signal. verl supports this through a server-based asynchronous rollout architecture that separates the inference engine from agent logic, preventing GPU idle time during tool execution and enabling large-scale multi-turn training.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
The inference engine uses a token-based
generate API rather than a standard chat completion API. This is essential for training correctness: text-to-token conversion is not always reversible (e.g. "<think>" re-tokenized differs from the original model output), so training must use the exact tokens produced during rollout to correctly compute advantages.System Architecture
The agentic rollout system has three main components that work together:| Component | Role |
|---|---|
| AgentLoop | Client-side component that implements agent functions and tool orchestration |
| LLMServerClient | Inference gateway that provides the generate interface to the AgentLoop |
| AsyncServer | Server-side component; each instance connects to one DP group of the inference engine |
- SGLang: Uses
async_generateon the engine’s first GPU in each TP group, called via Ray actor - vLLM: Uses the
generateinterface with ZMQ communication to TP group GPUs, callable directly in AsyncServer
Enabling Multi-Turn Rollout
To activate multi-turn mode, set the following in your rollout configuration:Implementing a Custom Tool
Tools are the primary way agents interact with the environment. verl provides two APIs:BaseTool for stateful tools that need lifecycle management, and @function_tool for simple stateless functions.
Using BaseTool
Create a class that extends
verl.tools.base_tool.BaseTool and implement the create, execute, and release lifecycle methods:from typing import Any, Optional
from verl.tools.base_tool import BaseTool, ToolResponse
class MySearchTool(BaseTool):
async def create(self, instance_id: Optional[str] = None, **kwargs) -> tuple[str, ToolResponse]:
# Initialize per-trajectory state (e.g. open a sandbox)
return instance_id, ToolResponse(text="Search tool ready.")
async def execute(
self, instance_id: str, parameters: dict[str, Any], **kwargs
) -> tuple[ToolResponse, float, dict]:
# Execute the tool and return (response, reward, metrics)
# parameters contains the tool call arguments, e.g. {"query": "..."}
result = await search(parameters["query"])
return ToolResponse(text=result), 0.0, {}
async def release(self, instance_id: str, **kwargs) -> None:
# Tear down per-trajectory state
pass
tools:
- class_name: "my_module.MySearchTool"
config:
type: native
tool_schema:
name: search
description: Search the web for information
parameters:
type: object
properties:
query:
type: string
description: The search query
required:
- query
The tool agent loop selects behavior based on a per-sample
agent_name field. Prepare your dataset with this field set:Using @function_tool (Stateless Tools)
For tools that don’t need create/release lifecycle hooks, the @function_tool decorator is the simpler option. verl infers the JSON schema automatically from the function’s type annotations and Google-style docstring:
- Type annotations (
str,int,list[T],Optional[X],Literal["a", "b"]) map to JSON Schema types - Per-parameter descriptions come from the
Args:docstring section - Parameters without defaults are marked
required - Sync functions are dispatched via
asyncio.to_thread; async functions run directly *args/**kwargsare not supported — useparam: list[T]instead
str → ToolResponse(text=...), dict → JSON-serialized ToolResponse, ToolResponse → passed through, (response, reward) or (response, reward, metrics) tuples are accepted.
function_tool_path and tool_config_path can coexist — the AgentLoopWorker merges both registries at startup (name collisions raise an error).
Multi-Modal Tool Outputs
If your tool produces images or videos, return them inside aToolResponse:
return_multi_modal_inputs: false in your dataset config to prevent pre-processed modalities from conflicting with dynamically generated tool outputs:
Multi-Turn Tokenization
Correctly attributing loss masks in multi-turn conversations requires a delta-based tokenization strategy. When the model generates a new assistant message at turni, verl:
- Applies the chat template to messages
[:i]withadd_generation_prompt=True - Applies the chat template to messages
[:i+1]withadd_generation_prompt=False - Tokenizes only the delta between the two strings
<think> reasoning blocks from earlier turns when rendering the chat template. For these models, verl falls back to a fixed two-message base conversation to anchor the delta:
Running the Multi-Turn GSM8K Example
During training you may see
"Failed to decode tool call" messages in the console when the model generates malformed tool call syntax. This is expected behavior and does not indicate a training error — the reward for that step will simply be zero.LangGraph Agent Framework
For more complex agent topologies, verl integrates with LangGraph through aReactAgentLoop adapter:
| Component | Role |
|---|---|
| ChatModel | LangChain LLM object that adapts to the generate API from LLMServerClient |
| ReactAgentLoop | Agent adapter layer with default naive LangGraph support; derive new classes to implement custom agents with a run function |
| AsyncServer | Server connected to one DP group of the inference engine |
recipe/langgraph_agent/example/README.md for a full LangGraph walkthrough.