Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt

Use this file to discover all available pages before exploring further.

This example demonstrates how to create an environment where models can solve math problems by writing and executing Python code. The environment provides a sandboxed Python REPL with scientific computing libraries.

Overview

The Math Python environment combines:
  • Dataset: MATH competition problems (or custom math datasets)
  • Tools: Python REPL with numpy, sympy, scipy
  • Evaluation: Symbolic math verification using \boxed{} answer format
  • Sandbox: Isolated execution environment with configurable resources

Complete Implementation

Here’s the full working implementation from environments/math_python/math_python.py:
import verifiers as vf
from verifiers.utils.data_utils import extract_boxed_answer, load_example_dataset


def load_environment(
    dataset_name: str = "math",
    dataset_split: str = "train",
    num_train_examples: int = -1,
    max_turns: int = 100,
    max_startup_wait_seconds: int = 60,
    pip_install_packages: str = "numpy sympy scipy",
    sandbox_cpu_cores: int = 1,
    sandbox_memory_gb: int = 2,
    sandbox_disk_size_gb: int = 5,
    sandbox_gpu_count: int = 0,
    sandbox_timeout_minutes: int = 60,
    sandbox_timeout_per_command_seconds: int = 60,
    sandbox_client_max_workers: int = 50,
    **kwargs,
):
    dataset = load_example_dataset(dataset_name, dataset_split, n=num_train_examples)
    pip_install_prompt = (
        f"In addition to the Python standard library, you have access to: {pip_install_packages}."
        if pip_install_packages.strip()
        else "You may only use the Python standard library."
    )
    system_prompt = (
        "Use Python for all calculations. Give your answer inside \\boxed{}."
    )
    system_prompt += "\n\n" + pip_install_prompt

    parser = vf.Parser(extract_fn=extract_boxed_answer)
    math_rubric = vf.MathRubric(parser=parser)
    return vf.PythonEnv(
        dataset=dataset,
        system_prompt=system_prompt,
        parser=parser,
        rubric=math_rubric,
        max_turns=max_turns,
        # python env args
        max_startup_wait_seconds=max_startup_wait_seconds,
        pip_install_packages=pip_install_packages,
        # sandbox env args
        cpu_cores=sandbox_cpu_cores,
        memory_gb=sandbox_memory_gb,
        disk_size_gb=sandbox_disk_size_gb,
        gpu_count=sandbox_gpu_count,
        timeout_minutes=sandbox_timeout_minutes,
        timeout_per_command_seconds=sandbox_timeout_per_command_seconds,
        sandbox_client_max_workers=sandbox_client_max_workers,
        **kwargs,
    )

How It Works

1. Dataset Loading

The environment uses the load_example_dataset utility to load math problems:
dataset = load_example_dataset("math", "train", n=num_train_examples)
Supported datasets:
  • "math" - MATH competition problems (training: 7,500 problems)
  • "math500" - MATH-500 benchmark (500 test problems)
  • "aime2024", "aime2025" - AIME competition problems
  • "gsm8k" - Grade school math (see GSM8K example)
Dataset format:
{
    "question": "What is the value of $\\sqrt{3^2 + 4^2}$?",
    "answer": "5"
}

2. System Prompt

The system prompt instructs the model to:
  • Use Python for calculations
  • Format final answers using \boxed{} notation
  • Lists available packages (numpy, sympy, scipy by default)
system_prompt = (
    "Use Python for all calculations. Give your answer inside \\boxed{}.\n\n"
    "In addition to the Python standard library, you have access to: numpy sympy scipy."
)

3. Answer Parsing

The extract_boxed_answer function extracts content from LaTeX \boxed{} notation:
parser = vf.Parser(extract_fn=extract_boxed_answer)

# Example: "The answer is \\boxed{42}" → "42"

4. Math Verification

MathRubric provides symbolic math verification:
math_rubric = vf.MathRubric(parser=parser)
Features:
  • Symbolic equivalence checking (e.g., “1/2” equals “0.5”)
  • LaTeX expression normalization
  • Floating-point tolerance for numerical answers
  • Returns 1.0 for correct answers, 0.0 otherwise

5. Python Sandbox Environment

PythonEnv provides:
  • Isolated execution environment (Docker container)
  • Persistent Python REPL session
  • Pre-installed packages (numpy, sympy, scipy)
  • Configurable resources (CPU, memory, disk)
  • Automatic cleanup after rollouts

Example Interaction

User: What is the value of 32+42\sqrt{3^2 + 4^2}?Assistant: I’ll use Python to calculate this.
import math
result = math.sqrt(3**2 + 4**2)
print(result)
Tool Output: 5.0Assistant: The value is 5\boxed{5}Result: ✓ Correct (reward = 1.0)

Running the Environment

Installation

# Install from environments directory
prime env install math-python

Quick Evaluation

# Evaluate with 10 problems
prime eval run math-python \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 10 \
  -r 1

Custom Configuration

# Use MATH-500 benchmark with more resources
prime eval run math-python \
  -m openai/gpt-4.1-mini \
  -a '{
    "dataset_name": "math500",
    "dataset_split": "test",
    "sandbox_cpu_cores": 2,
    "sandbox_memory_gb": 4,
    "pip_install_packages": "numpy sympy scipy matplotlib"
  }' \
  -n 50 \
  -r 4

Configuration Options

ParameterDefaultDescription
dataset_name"math"Dataset to use (math, math500, aime2024, etc.)
dataset_split"train"Dataset split (train, test)
num_train_examples-1Number of examples (-1 = all)
max_turns100Maximum interaction turns
pip_install_packages"numpy sympy scipy"Space-separated package list
sandbox_cpu_cores1CPU cores for sandbox
sandbox_memory_gb2Memory in GB
sandbox_disk_size_gb5Disk size in GB
sandbox_timeout_minutes60Sandbox lifetime timeout

Key Features

Sandboxed Execution

  • Isolation: Each rollout gets a fresh sandbox container
  • Security: No access to host filesystem or network (by default)
  • Resource limits: Configurable CPU, memory, and disk quotas
  • Automatic cleanup: Containers are destroyed after rollouts

Package Management

Customize available packages:
env = load_environment(
    pip_install_packages="numpy sympy scipy matplotlib pandas"
)
Or restrict to standard library only:
env = load_environment(
    pip_install_packages=""  # Empty string = standard library only
)

Multi-Turn Interaction

The environment supports iterative problem-solving:
  1. Model writes Python code
  2. Code executes in sandbox
  3. Model sees output and continues reasoning
  4. Repeats until model provides final answer or hits max_turns

Metrics Tracked

  • correct_answer: 1.0 if answer matches ground truth, 0.0 otherwise
  • num_turns: Number of model-environment interactions
  • sandbox_ready_wait_time: Time to initialize sandbox (seconds)
  • sandbox_command_execution_time: Total time executing Python code
  • python_ready_wait_time: Time to start Python REPL

Advanced Usage

Custom Answer Extraction

Provide your own answer extraction logic:
def custom_extract_answer(text: str) -> str:
    """Extract answer from custom format."""
    if "ANSWER:" in text:
        return text.split("ANSWER:")[1].strip()
    return text

parser = vf.Parser(extract_fn=custom_extract_answer)
rubric = vf.MathRubric(parser=parser)
env = vf.PythonEnv(
    dataset=dataset,
    parser=parser,
    rubric=rubric,
    system_prompt="Solve the problem and format your answer as ANSWER: <value>"
)

Custom Reward Functions

Add additional reward signals:
async def efficiency_bonus(state, answer) -> float:
    """Reward shorter solutions."""
    num_turns = state.get("turn", 0)
    is_correct = state.get("completion", [])[-1].get("content", "").strip()
    if answer in is_correct and num_turns < 5:
        return 0.2  # Bonus for solving quickly
    return 0.0

math_rubric.add_reward_func(efficiency_bonus, weight=1.0)
  • GSM8K - Single-turn math reasoning without code execution
  • Wiki Search - Tool environment with custom tools
  • Browser Examples - More complex stateful environments

Next Steps

  • Learn about Environments to understand the architecture
  • See Sandboxes for more on containerized execution
  • Explore Rubrics for custom evaluation logic

Build docs developers (and LLMs) love