Documentation Index
Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
Use this file to discover all available pages before exploring further.
ToolEnv
Environment for tasks where the model can call Python functions as tools.
Overview
ToolEnv enables LLMs to call Python functions with all arguments exposed to the model. Key features:
- Stateless tools: Each tool call is independent and idempotent
- Automatic schema generation: Function signatures are converted to tool definitions
- Error handling: Configurable error formatting and stop-on-error behavior
- Tool metrics: Automatic tracking of tool call counts
For tools requiring per-rollout state (e.g., sandbox IDs, database connections), use StatefulToolEnv instead.
Inheritance
Environment
└── MultiTurnEnv
└── ToolEnv
└── StatefulToolEnv
Constructor
ToolEnv(
tools: list[Callable] | None = None,
max_turns: int = 10,
error_formatter: Callable[[Exception], str] = lambda e: f"{e}",
stop_errors: list[type[Exception]] | None = None,
**kwargs
)
Parameters
List of Python functions to expose as tools. Function signatures and docstrings are used to generate tool schemas.
Maximum number of turns before stopping.
error_formatter
Callable[[Exception], str]
default:"lambda e: f'{e}'"
Function to format exceptions into error messages shown to the model.
stop_errors
list[type[Exception]] | None
List of exception types that should stop the rollout (raise ToolParseError or ToolCallError).
All other parameters are inherited from MultiTurnEnv.
Core Methods
async def call_tool(
tool_name: str,
tool_args: dict,
tool_call_id: str,
**kwargs
) -> ToolMessage
Execute a tool and return the result as a ToolMessage. Override to customize tool execution.
Name of the tool to call.
Arguments parsed from the model’s tool call.
Unique ID for this tool call.
Returns: ToolMessage - Message containing tool result or error.
env_response
async def env_response(
messages: vf.Messages,
state: vf.State,
**kwargs
) -> vf.Messages
Process tool calls from the model’s response. Implemented by ToolEnv - do not override unless you need custom behavior.
Conversation history including model’s tool calls.
Returns: vf.Messages - List of ToolMessage objects with results.
def add_tool(tool: Callable)
Dynamically add a tool to the environment.
Python function to add as a tool.
def remove_tool(tool: Callable)
Remove a tool from the environment.
Python function to remove.
Stop Conditions
@vf.stop
async def no_tools_called(state: vf.State) -> bool
Stops if the model’s last message was an assistant message with no tool calls.
Inherits all stop conditions from MultiTurnEnv.
Built-in Rubric
ToolEnv includes ToolMonitorRubric which tracks:
total_tool_calls: Total number of tool calls made
{tool_name}_calls: Number of calls to each specific tool
Example Usage
Basic Calculator
import verifiers as vf
def add(a: float, b: float) -> float:
"""Add two numbers."""
return a + b
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
return a * b
def divide(a: float, b: float) -> float:
"""Divide two numbers."""
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
def load_environment():
# Create dataset
dataset = vf.Environment.make_dataset(
[
{"question": "What is (10 + 5) * 3?", "answer": "45"},
{"question": "What is 100 / 4?", "answer": "25"},
]
)
def correct_answer(answer: str, completion: vf.Messages) -> float:
"""Check if final answer matches expected answer."""
completion_text = str(completion)
return 1.0 if answer in completion_text else 0.0
return vf.ToolEnv(
tools=[add, multiply, divide],
dataset=dataset,
rubric=vf.Rubric(correct_answer),
system_prompt="Use the available tools to solve the math problem.",
max_turns=5
)
# Usage
env = load_environment()
results = await env.evaluate(
client=vf.ClientConfig(provider="openai", api_key="sk-..."),
model="gpt-4",
num_examples=2
)
print(f"Accuracy: {results['metadata']['avg_reward']}")
print(f"Avg tool calls: {results['metadata']['avg_total_tool_calls']}")
With Error Handling
import verifiers as vf
class DivisionError(Exception):
"""Custom error for division problems."""
pass
def divide(a: float, b: float) -> float:
"""Divide two numbers."""
if b == 0:
raise DivisionError("Cannot divide by zero")
return a / b
def load_environment():
dataset = vf.Environment.make_dataset(
[{"question": "What is 10 / 0?", "answer": "error"}]
)
def error_formatter(e: Exception) -> str:
"""Format errors for the model."""
if isinstance(e, DivisionError):
return "Error: Division by zero is not allowed."
return f"Error: {str(e)}"
def handles_error(completion: vf.Messages) -> float:
"""Reward if model acknowledges the error."""
text = str(completion).lower()
return 1.0 if "error" in text or "cannot" in text else 0.0
return vf.ToolEnv(
tools=[divide],
dataset=dataset,
rubric=vf.Rubric(handles_error),
error_formatter=error_formatter,
# Don't stop on DivisionError, let model handle it
stop_errors=[], # Empty list = no errors cause stop
max_turns=3
)
With Stop Errors
import verifiers as vf
class CriticalError(Exception):
pass
def risky_operation(value: int) -> str:
if value < 0:
raise CriticalError("Negative values not allowed")
return f"Result: {value * 2}"
def load_environment():
dataset = vf.Environment.make_dataset(
[{"question": "Process the value -5"}]
)
return vf.ToolEnv(
tools=[risky_operation],
dataset=dataset,
rubric=vf.Rubric(lambda completion: 0.0),
# Stop rollout immediately if CriticalError occurs
stop_errors=[CriticalError],
max_turns=5
)
# When CriticalError is raised, the rollout stops and
# state["error"] contains a ToolCallError
import verifiers as vf
import sqlite3
def query_users(name: str) -> list[dict]:
"""Query users by name."""
# Stateless query (creates new connection each time)
conn = sqlite3.connect("users.db")
cursor = conn.execute("SELECT * FROM users WHERE name LIKE ?", (f"%{name}%",))
results = [{"id": row[0], "name": row[1]} for row in cursor.fetchall()]
conn.close()
return results
def query_orders(user_id: int) -> list[dict]:
"""Query orders for a user."""
conn = sqlite3.connect("users.db")
cursor = conn.execute("SELECT * FROM orders WHERE user_id = ?", (user_id,))
results = [{"id": row[0], "total": row[1]} for row in cursor.fetchall()]
conn.close()
return results
def load_environment():
dataset = vf.Environment.make_dataset(
[
{"question": "How many orders does user 'Alice' have?", "answer": "3"},
]
)
def correct_count(answer: str, completion: vf.Messages) -> float:
return 1.0 if answer in str(completion) else 0.0
return vf.ToolEnv(
tools=[query_users, query_orders],
dataset=dataset,
rubric=vf.Rubric(correct_count),
system_prompt="Use the database tools to answer questions.",
max_turns=10
)
import verifiers as vf
import httpx
def get_weather(city: str) -> dict:
"""Get current weather for a city."""
# Stateless API call
response = httpx.get(f"https://api.weather.com/v1/current?city={city}")
return response.json()
def get_forecast(city: str, days: int = 3) -> dict:
"""Get weather forecast for a city."""
response = httpx.get(
f"https://api.weather.com/v1/forecast?city={city}&days={days}"
)
return response.json()
def load_environment():
dataset = vf.Environment.make_dataset(
[
{"question": "Will it rain in London tomorrow?", "answer": "yes"},
]
)
def mentions_rain(answer: str, completion: vf.Messages) -> float:
text = str(completion).lower()
answer_lower = answer.lower()
return 1.0 if answer_lower in text else 0.0
return vf.ToolEnv(
tools=[get_weather, get_forecast],
dataset=dataset,
rubric=vf.Rubric(mentions_rain),
max_turns=5
)
import verifiers as vf
def base_tool() -> str:
return "base"
env = vf.ToolEnv(
tools=[base_tool],
dataset=dataset,
rubric=vf.Rubric(reward_fn)
)
# Add tool dynamically
def new_tool(x: int) -> int:
"""New tool added at runtime."""
return x * 2
env.add_tool(new_tool)
# Remove tool
env.remove_tool(base_tool)
Tools are automatically converted to schema using function signatures and docstrings:
def search(query: str, max_results: int = 10) -> list[str]:
"""Search for documents matching the query.
Args:
query: Search query string
max_results: Maximum number of results to return
"""
return ["result1", "result2"]
Generates schema:
{
"name": "search",
"description": "Search for documents matching the query.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query string"},
"max_results": {"type": "integer", "description": "Maximum number of results to return", "default": 10}
},
"required": ["query"]
}
}
Common Patterns
All tool calls should be independent:
def good_tool(x: int) -> int:
# No shared state, idempotent
return x * 2
# Avoid global state
state = {}
def bad_tool(x: int) -> int:
state["count"] = state.get("count", 0) + 1 # Bad!
return x * state["count"]
For stateful tools, use StatefulToolEnv.
Custom Error Messages
Format errors to guide the model:
def error_formatter(e: Exception) -> str:
if isinstance(e, ValueError):
return f"Invalid input: {e}. Please provide a valid number."
elif isinstance(e, KeyError):
return f"Key not found: {e}. Available keys: X, Y, Z."
return f"Error: {e}"
env = vf.ToolEnv(
tools=[...],
error_formatter=error_formatter,
...
)
def efficiency_reward(state: vf.State) -> float:
"""Reward fewer tool calls."""
metrics = state["metrics"]
num_calls = metrics.get("total_tool_calls", 0)
if state["reward"] == 1.0: # Correct answer
return 1.0 / (1 + num_calls) # Fewer calls = higher reward
return 0.0
When to Use
Use ToolEnv for:
- Stateless function calling (calculators, converters, queries)
- API clients (each call is independent)
- Read-only database queries
- File reading operations
- Any idempotent tool
Use StatefulToolEnv for:
- Tools requiring per-rollout state (sandbox IDs, sessions)
- Database transactions
- File writing in isolated environments
- Any tool where state must persist across calls
See Also