Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

Agentic RL extends standard single-turn RL training by letting the model interact with tools and environments over multiple conversation turns before receiving a final reward signal. verl supports this through a server-based asynchronous rollout architecture that separates the inference engine from agent logic, preventing GPU idle time during tool execution and enabling large-scale multi-turn training.
The inference engine uses a token-based generate API rather than a standard chat completion API. This is essential for training correctness: text-to-token conversion is not always reversible (e.g. "<think>" re-tokenized differs from the original model output), so training must use the exact tokens produced during rollout to correctly compute advantages.

System Architecture

The agentic rollout system has three main components that work together:
ComponentRole
AgentLoopClient-side component that implements agent functions and tool orchestration
LLMServerClientInference gateway that provides the generate interface to the AgentLoop
AsyncServerServer-side component; each instance connects to one DP group of the inference engine
The AsyncServer has separate implementations for SGLang and vLLM:
  • SGLang: Uses async_generate on the engine’s first GPU in each TP group, called via Ray actor
  • vLLM: Uses the generate interface with ZMQ communication to TP group GPUs, callable directly in AsyncServer
An asyncio coroutine mechanism allows multiple rollout requests to execute concurrently. While one request waits for a tool call to return, other requests continue generating — this prevents GPU idle and dramatically improves throughput on long-tail tool calls.

Enabling Multi-Turn Rollout

To activate multi-turn mode, set the following in your rollout configuration:
actor_rollout_ref:
  rollout:
    multi_turn: true
    name: sglang
For the agent loop, two additional options are required:
data:
  return_raw_chat: true

actor_rollout_ref:
  rollout:
    mode: async

Implementing a Custom Tool

Tools are the primary way agents interact with the environment. verl provides two APIs: BaseTool for stateful tools that need lifecycle management, and @function_tool for simple stateless functions.

Using BaseTool

1
Subclass BaseTool
2
Create a class that extends verl.tools.base_tool.BaseTool and implement the create, execute, and release lifecycle methods:
3
from typing import Any, Optional
from verl.tools.base_tool import BaseTool, ToolResponse

class MySearchTool(BaseTool):
    async def create(self, instance_id: Optional[str] = None, **kwargs) -> tuple[str, ToolResponse]:
        # Initialize per-trajectory state (e.g. open a sandbox)
        return instance_id, ToolResponse(text="Search tool ready.")

    async def execute(
        self, instance_id: str, parameters: dict[str, Any], **kwargs
    ) -> tuple[ToolResponse, float, dict]:
        # Execute the tool and return (response, reward, metrics)
        # parameters contains the tool call arguments, e.g. {"query": "..."}
        result = await search(parameters["query"])
        return ToolResponse(text=result), 0.0, {}

    async def release(self, instance_id: str, **kwargs) -> None:
        # Tear down per-trajectory state
        pass
4
Write a tool configuration YAML
5
Describe your tool’s schema and implementation class:
6
tools:
  - class_name: "my_module.MySearchTool"
    config:
      type: native
    tool_schema:
      name: search
      description: Search the web for information
      parameters:
        type: object
        properties:
          query:
            type: string
            description: The search query
        required:
          - query
7
Register the tool config in your rollout config
8
actor_rollout_ref:
  rollout:
    tool_kwargs:
      tools_config_file: /path/to/tools.yaml
9
Add agent_name to your dataset
10
The tool agent loop selects behavior based on a per-sample agent_name field. Prepare your dataset with this field set:
11
python examples/data_preprocess/gsm8k_tool_agent_loop.py

Using @function_tool (Stateless Tools)

For tools that don’t need create/release lifecycle hooks, the @function_tool decorator is the simpler option. verl infers the JSON schema automatically from the function’s type annotations and Google-style docstring:
from verl.tools.function_tool import function_tool

@function_tool
def get_weather(city: str) -> dict:
    """Get the current weather for a city.

    Args:
        city: The city to look up, e.g. "Tokyo" or "San Francisco".
    """
    return {"temperature_c": 17.3, "condition": "drizzle"}

@function_tool("calculator")  # explicit name overrides function name
def calculator(expression: str) -> str:
    """Evaluate a Python-style arithmetic expression.

    Args:
        expression: A Python-style arithmetic expression, e.g. "(3+4)*5".
    """
    return str(eval(expression, {"__builtins__": {}}, {}))
Configure it in the rollout config:
actor_rollout_ref:
  rollout:
    mode: async
    multi_turn:
      enable: true
      format: hermes
      function_tool_path: path/to/your_tools.py
    agent:
      default_agent_loop: tool_agent
Schema inference rules:
  • Type annotations (str, int, list[T], Optional[X], Literal["a", "b"]) map to JSON Schema types
  • Per-parameter descriptions come from the Args: docstring section
  • Parameters without defaults are marked required
  • Sync functions are dispatched via asyncio.to_thread; async functions run directly
  • *args / **kwargs are not supported — use param: list[T] instead
Return value normalization: strToolResponse(text=...), dict → JSON-serialized ToolResponse, ToolResponse → passed through, (response, reward) or (response, reward, metrics) tuples are accepted. function_tool_path and tool_config_path can coexist — the AgentLoopWorker merges both registries at startup (name collisions raise an error).

Multi-Modal Tool Outputs

If your tool produces images or videos, return them inside a ToolResponse:
from verl.utils.dataset.vision_utils import process_image, process_video

async def execute(self, instance_id: str, ...) -> tuple[ToolResponse, float, dict]:
    img = process_image(raw_image)
    video = process_video(raw_video)
    # Use "image"/"video" keys (not "images"/"videos") for list inputs
    return ToolResponse(image=[img], video=[video], text="Result."), 0.0, {}
Also set return_multi_modal_inputs: false in your dataset config to prevent pre-processed modalities from conflicting with dynamically generated tool outputs:
data:
  return_multi_modal_inputs: false

Multi-Turn Tokenization

Correctly attributing loss masks in multi-turn conversations requires a delta-based tokenization strategy. When the model generates a new assistant message at turn i, verl:
  1. Applies the chat template to messages [:i] with add_generation_prompt=True
  2. Applies the chat template to messages [:i+1] with add_generation_prompt=False
  3. Tokenizes only the delta between the two strings
This ensures the loss mask covers only assistant-generated tokens:
prev = tokenizer.apply_chat_template(
    messages[:i], add_generation_prompt=True, tokenize=False
)
curr = tokenizer.apply_chat_template(
    messages[:i+1], add_generation_prompt=False, tokenize=False
)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids)
A tokenization sanity check runs by default at the end of each rollout, comparing delta-based results against full tokenization. Control this with:
actor_rollout_ref:
  rollout:
    multi_turn:
      tokenization_sanity_check_mode: "strict"  # or "ignore_strippable" or "disable"
Special case — models that strip reasoning content: Qwen/QwQ-32B and Qwen3 series remove internal <think> reasoning blocks from earlier turns when rendering the chat template. For these models, verl falls back to a fixed two-message base conversation to anchor the delta:
BASE_CHAT_HISTORY = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "I am a user."}
]
prev = tokenizer.apply_chat_template(
    BASE_CHAT_HISTORY, add_generation_prompt=True, tokenize=False
)
curr = tokenizer.apply_chat_template(
    [*BASE_CHAT_HISTORY, messages[i]], add_generation_prompt=False, tokenize=False
)
To align rollout with production inference (which strips reasoning from past turns), enable:
actor_rollout_ref:
  rollout:
    multi_turn:
      use_inference_chat_template: true
Note this trades off rollout–production alignment against potential context window overflow from long reasoning content in multi-turn conversations.

Running the Multi-Turn GSM8K Example

# Install mlflow for rollout trace visualization
pip install mlflow

# Preprocess GSM8K and add the "agent_name" field
python examples/data_preprocess/gsm8k_tool_agent_loop.py

# Train with tool calls and mlflow trace
bash examples/sglang_multiturn/run_qwen2_5_3b_gsm8k_tool_agent_mlflow_fsdp.sh

# View traces in your browser
mlflow ui -h 0.0.0.0 -p 5000 --backend-store-uri sqlite:////tmp/mlruns.db
# Open http://<your-ip>:5000
During training you may see "Failed to decode tool call" messages in the console when the model generates malformed tool call syntax. This is expected behavior and does not indicate a training error — the reward for that step will simply be zero.

LangGraph Agent Framework

For more complex agent topologies, verl integrates with LangGraph through a ReactAgentLoop adapter:
ComponentRole
ChatModelLangChain LLM object that adapts to the generate API from LLMServerClient
ReactAgentLoopAgent adapter layer with default naive LangGraph support; derive new classes to implement custom agents with a run function
AsyncServerServer connected to one DP group of the inference engine
See recipe/langgraph_agent/example/README.md for a full LangGraph walkthrough.

Build docs developers (and LLMs) love