Multi-Turn and Agentic RL Training with verl

Agentic RL extends standard single-turn RL training by letting the model interact with tools and environments over multiple conversation turns before receiving a final reward signal. verl supports this through a server-based asynchronous rollout architecture that separates the inference engine from agent logic, preventing GPU idle time during tool execution and enabling large-scale multi-turn training.

The inference engine uses a token-based generate API rather than a standard chat completion API. This is essential for training correctness: text-to-token conversion is not always reversible (e.g. "<think>" re-tokenized differs from the original model output), so training must use the exact tokens produced during rollout to correctly compute advantages.

System Architecture

The agentic rollout system has three main components that work together:

Component	Role
AgentLoop	Client-side component that implements agent functions and tool orchestration
LLMServerClient	Inference gateway that provides the `generate` interface to the AgentLoop
AsyncServer	Server-side component; each instance connects to one DP group of the inference engine

The AsyncServer has separate implementations for SGLang and vLLM:

SGLang: Uses async_generate on the engine’s first GPU in each TP group, called via Ray actor
vLLM: Uses the generate interface with ZMQ communication to TP group GPUs, callable directly in AsyncServer

An asyncio coroutine mechanism allows multiple rollout requests to execute concurrently. While one request waits for a tool call to return, other requests continue generating — this prevents GPU idle and dramatically improves throughput on long-tail tool calls.

Enabling Multi-Turn Rollout

To activate multi-turn mode, set the following in your rollout configuration:

actor_rollout_ref:
  rollout:
    multi_turn: true
    name: sglang

For the agent loop, two additional options are required:

data:
  return_raw_chat: true

actor_rollout_ref:
  rollout:
    mode: async

Implementing a Custom Tool

Tools are the primary way agents interact with the environment. verl provides two APIs: BaseTool for stateful tools that need lifecycle management, and @function_tool for simple stateless functions.

Using BaseTool

Subclass BaseTool

Create a class that extends verl.tools.base_tool.BaseTool and implement the create, execute, and release lifecycle methods:

from typing import Any, Optional
from verl.tools.base_tool import BaseTool, ToolResponse

class MySearchTool(BaseTool):
    async def create(self, instance_id: Optional[str] = None, **kwargs) -> tuple[str, ToolResponse]:
        # Initialize per-trajectory state (e.g. open a sandbox)
        return instance_id, ToolResponse(text="Search tool ready.")

    async def execute(
        self, instance_id: str, parameters: dict[str, Any], **kwargs
    ) -> tuple[ToolResponse, float, dict]:
        # Execute the tool and return (response, reward, metrics)
        # parameters contains the tool call arguments, e.g. {"query": "..."}
        result = await search(parameters["query"])
        return ToolResponse(text=result), 0.0, {}

    async def release(self, instance_id: str, **kwargs) -> None:
        # Tear down per-trajectory state
        pass

Write a tool configuration YAML

Describe your tool’s schema and implementation class:

tools:
  - class_name: "my_module.MySearchTool"
    config:
      type: native
    tool_schema:
      name: search
      description: Search the web for information
      parameters:
        type: object
        properties:
          query:
            type: string
            description: The search query
        required:
          - query

actor_rollout_ref:
  rollout:
    tool_kwargs:
      tools_config_file: /path/to/tools.yaml

Add agent_name to your dataset

The tool agent loop selects behavior based on a per-sample agent_name field. Prepare your dataset with this field set:

python examples/data_preprocess/gsm8k_tool_agent_loop.py

Using `@function_tool` (Stateless Tools)

For tools that don’t need create/release lifecycle hooks, the @function_tool decorator is the simpler option. verl infers the JSON schema automatically from the function’s type annotations and Google-style docstring:

from verl.tools.function_tool import function_tool

@function_tool
def get_weather(city: str) -> dict:
    """Get the current weather for a city.

    Args:
        city: The city to look up, e.g. "Tokyo" or "San Francisco".
    """
    return {"temperature_c": 17.3, "condition": "drizzle"}

@function_tool("calculator")  # explicit name overrides function name
def calculator(expression: str) -> str:
    """Evaluate a Python-style arithmetic expression.

    Args:
        expression: A Python-style arithmetic expression, e.g. "(3+4)*5".
    """
    return str(eval(expression, {"__builtins__": {}}, {}))

Configure it in the rollout config:

actor_rollout_ref:
  rollout:
    mode: async
    multi_turn:
      enable: true
      format: hermes
      function_tool_path: path/to/your_tools.py
    agent:
      default_agent_loop: tool_agent

Schema inference rules:

Type annotations (str, int, list[T], Optional[X], Literal["a", "b"]) map to JSON Schema types
Per-parameter descriptions come from the Args: docstring section
Parameters without defaults are marked required
Sync functions are dispatched via asyncio.to_thread; async functions run directly
*args / **kwargs are not supported — use param: list[T] instead

Return value normalization: str → ToolResponse(text=...), dict → JSON-serialized ToolResponse, ToolResponse → passed through, (response, reward) or (response, reward, metrics) tuples are accepted. function_tool_path and tool_config_path can coexist — the AgentLoopWorker merges both registries at startup (name collisions raise an error). If your tool produces images or videos, return them inside a ToolResponse:

from verl.utils.dataset.vision_utils import process_image, process_video

async def execute(self, instance_id: str, ...) -> tuple[ToolResponse, float, dict]:
    img = process_image(raw_image)
    video = process_video(raw_video)
    # Use "image"/"video" keys (not "images"/"videos") for list inputs
    return ToolResponse(image=[img], video=[video], text="Result."), 0.0, {}

Also set return_multi_modal_inputs: false in your dataset config to prevent pre-processed modalities from conflicting with dynamically generated tool outputs:

data:
  return_multi_modal_inputs: false

Multi-Turn Tokenization

Correctly attributing loss masks in multi-turn conversations requires a delta-based tokenization strategy. When the model generates a new assistant message at turn i, verl:

Applies the chat template to messages [:i] with add_generation_prompt=True
Applies the chat template to messages [:i+1] with add_generation_prompt=False
Tokenizes only the delta between the two strings

This ensures the loss mask covers only assistant-generated tokens:

prev = tokenizer.apply_chat_template(
    messages[:i], add_generation_prompt=True, tokenize=False
)
curr = tokenizer.apply_chat_template(
    messages[:i+1], add_generation_prompt=False, tokenize=False
)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids)

A tokenization sanity check runs by default at the end of each rollout, comparing delta-based results against full tokenization. Control this with:

actor_rollout_ref:
  rollout:
    multi_turn:
      tokenization_sanity_check_mode: "strict"  # or "ignore_strippable" or "disable"

Special case — models that strip reasoning content: Qwen/QwQ-32B and Qwen3 series remove internal <think> reasoning blocks from earlier turns when rendering the chat template. For these models, verl falls back to a fixed two-message base conversation to anchor the delta:

BASE_CHAT_HISTORY = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "I am a user."}
]
prev = tokenizer.apply_chat_template(
    BASE_CHAT_HISTORY, add_generation_prompt=True, tokenize=False
)
curr = tokenizer.apply_chat_template(
    [*BASE_CHAT_HISTORY, messages[i]], add_generation_prompt=False, tokenize=False
)

To align rollout with production inference (which strips reasoning from past turns), enable:

actor_rollout_ref:
  rollout:
    multi_turn:
      use_inference_chat_template: true

Note this trades off rollout–production alignment against potential context window overflow from long reasoning content in multi-turn conversations.

Running the Multi-Turn GSM8K Example

# Install mlflow for rollout trace visualization
pip install mlflow

# Preprocess GSM8K and add the "agent_name" field
python examples/data_preprocess/gsm8k_tool_agent_loop.py

# Train with tool calls and mlflow trace
bash examples/sglang_multiturn/run_qwen2_5_3b_gsm8k_tool_agent_mlflow_fsdp.sh

# View traces in your browser
mlflow ui -h 0.0.0.0 -p 5000 --backend-store-uri sqlite:////tmp/mlruns.db
# Open http://<your-ip>:5000

During training you may see "Failed to decode tool call" messages in the console when the model generates malformed tool call syntax. This is expected behavior and does not indicate a training error — the reward for that step will simply be zero.

LangGraph Agent Framework

For more complex agent topologies, verl integrates with LangGraph through a ReactAgentLoop adapter:

Component	Role
ChatModel	LangChain LLM object that adapts to the `generate` API from `LLMServerClient`
ReactAgentLoop	Agent adapter layer with default naive LangGraph support; derive new classes to implement custom agents with a `run` function
AsyncServer	Server connected to one DP group of the inference engine

See recipe/langgraph_agent/example/README.md for a full LangGraph walkthrough.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Multi-Turn and Agentic RL Training with verl

System Architecture

Enabling Multi-Turn Rollout

Implementing a Custom Tool

Using BaseTool

Using `@function_tool` (Stateless Tools)

Multi-Turn Tokenization

Running the Multi-Turn GSM8K Example

LangGraph Agent Framework

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​System Architecture

​Enabling Multi-Turn Rollout

​Implementing a Custom Tool

​Using BaseTool

​Using @function_tool (Stateless Tools)

​Multi-Modal Tool Outputs

​Multi-Turn Tokenization

​Running the Multi-Turn GSM8K Example

​LangGraph Agent Framework

Build docs developers (and LLMs) love

System Architecture

Enabling Multi-Turn Rollout

Implementing a Custom Tool

Using BaseTool

Using `@function_tool` (Stateless Tools)

Multi-Modal Tool Outputs

Multi-Turn Tokenization

Running the Multi-Turn GSM8K Example

LangGraph Agent Framework