Math with Python Execution

This example demonstrates how to create an environment where models can solve math problems by writing and executing Python code. The environment provides a sandboxed Python REPL with scientific computing libraries.

Overview

The Math Python environment combines:

Dataset: MATH competition problems (or custom math datasets)
Tools: Python REPL with numpy, sympy, scipy
Evaluation: Symbolic math verification using \boxed{} answer format
Sandbox: Isolated execution environment with configurable resources

Complete Implementation

Here’s the full working implementation from environments/math_python/math_python.py:

import verifiers as vf
from verifiers.utils.data_utils import extract_boxed_answer, load_example_dataset


def load_environment(
    dataset_name: str = "math",
    dataset_split: str = "train",
    num_train_examples: int = -1,
    max_turns: int = 100,
    max_startup_wait_seconds: int = 60,
    pip_install_packages: str = "numpy sympy scipy",
    sandbox_cpu_cores: int = 1,
    sandbox_memory_gb: int = 2,
    sandbox_disk_size_gb: int = 5,
    sandbox_gpu_count: int = 0,
    sandbox_timeout_minutes: int = 60,
    sandbox_timeout_per_command_seconds: int = 60,
    sandbox_client_max_workers: int = 50,
    **kwargs,
):
    dataset = load_example_dataset(dataset_name, dataset_split, n=num_train_examples)
    pip_install_prompt = (
        f"In addition to the Python standard library, you have access to: {pip_install_packages}."
        if pip_install_packages.strip()
        else "You may only use the Python standard library."
    )
    system_prompt = (
        "Use Python for all calculations. Give your answer inside \\boxed{}."
    )
    system_prompt += "\n\n" + pip_install_prompt

    parser = vf.Parser(extract_fn=extract_boxed_answer)
    math_rubric = vf.MathRubric(parser=parser)
    return vf.PythonEnv(
        dataset=dataset,
        system_prompt=system_prompt,
        parser=parser,
        rubric=math_rubric,
        max_turns=max_turns,
        # python env args
        max_startup_wait_seconds=max_startup_wait_seconds,
        pip_install_packages=pip_install_packages,
        # sandbox env args
        cpu_cores=sandbox_cpu_cores,
        memory_gb=sandbox_memory_gb,
        disk_size_gb=sandbox_disk_size_gb,
        gpu_count=sandbox_gpu_count,
        timeout_minutes=sandbox_timeout_minutes,
        timeout_per_command_seconds=sandbox_timeout_per_command_seconds,
        sandbox_client_max_workers=sandbox_client_max_workers,
        **kwargs,
    )

How It Works

1. Dataset Loading

The environment uses the load_example_dataset utility to load math problems:

dataset = load_example_dataset("math", "train", n=num_train_examples)

Supported datasets:

"math" - MATH competition problems (training: 7,500 problems)
"math500" - MATH-500 benchmark (500 test problems)
"aime2024", "aime2025" - AIME competition problems
"gsm8k" - Grade school math (see GSM8K example)

Dataset format:

{
    "question": "What is the value of $\\sqrt{3^2 + 4^2}$?",
    "answer": "5"
}

2. System Prompt

The system prompt instructs the model to:

Use Python for calculations
Format final answers using \boxed{} notation
Lists available packages (numpy, sympy, scipy by default)

system_prompt = (
    "Use Python for all calculations. Give your answer inside \\boxed{}.\n\n"
    "In addition to the Python standard library, you have access to: numpy sympy scipy."
)

3. Answer Parsing

The extract_boxed_answer function extracts content from LaTeX \boxed{} notation:

parser = vf.Parser(extract_fn=extract_boxed_answer)

# Example: "The answer is \\boxed{42}" → "42"

4. Math Verification

MathRubric provides symbolic math verification:

math_rubric = vf.MathRubric(parser=parser)

Features:

Symbolic equivalence checking (e.g., “1/2” equals “0.5”)
LaTeX expression normalization
Floating-point tolerance for numerical answers
Returns 1.0 for correct answers, 0.0 otherwise

5. Python Sandbox Environment

PythonEnv provides:

Isolated execution environment (Docker container)
Persistent Python REPL session
Pre-installed packages (numpy, sympy, scipy)
Configurable resources (CPU, memory, disk)
Automatic cleanup after rollouts

Example Interaction

Model Interaction
Dataset Sample
Expected Solution

User: What is the value of

\sqrt{3^2 + 4^2}

?Assistant: I’ll use Python to calculate this.

import math
result = math.sqrt(3**2 + 4**2)
print(result)

Tool Output: 5.0Assistant: The value is

\boxed{5}

Result: ✓ Correct (reward = 1.0)

{
    "question": "Find the largest prime factor of $9879$.",
    "answer": "89"
}

def largest_prime_factor(n):
    i = 2
    while i * i <= n:
        if n % i:
            i += 1
        else:
            n //= i
    return n

result = largest_prime_factor(9879)
print(result)  # Output: 89

Final answer:

\boxed{89}

Running the Environment

Installation

# Install from environments directory
prime env install math-python

Quick Evaluation

# Evaluate with 10 problems
prime eval run math-python \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 10 \
  -r 1

Custom Configuration

# Use MATH-500 benchmark with more resources
prime eval run math-python \
  -m openai/gpt-4.1-mini \
  -a '{
    "dataset_name": "math500",
    "dataset_split": "test",
    "sandbox_cpu_cores": 2,
    "sandbox_memory_gb": 4,
    "pip_install_packages": "numpy sympy scipy matplotlib"
  }' \
  -n 50 \
  -r 4

Configuration Options

Parameter	Default	Description
`dataset_name`	`"math"`	Dataset to use (math, math500, aime2024, etc.)
`dataset_split`	`"train"`	Dataset split (train, test)
`num_train_examples`	`-1`	Number of examples (-1 = all)
`max_turns`	`100`	Maximum interaction turns
`pip_install_packages`	`"numpy sympy scipy"`	Space-separated package list
`sandbox_cpu_cores`	`1`	CPU cores for sandbox
`sandbox_memory_gb`	`2`	Memory in GB
`sandbox_disk_size_gb`	`5`	Disk size in GB
`sandbox_timeout_minutes`	`60`	Sandbox lifetime timeout

Key Features

Sandboxed Execution

Isolation: Each rollout gets a fresh sandbox container
Security: No access to host filesystem or network (by default)
Resource limits: Configurable CPU, memory, and disk quotas
Automatic cleanup: Containers are destroyed after rollouts

Package Management

Customize available packages:

env = load_environment(
    pip_install_packages="numpy sympy scipy matplotlib pandas"
)

Or restrict to standard library only:

env = load_environment(
    pip_install_packages=""  # Empty string = standard library only
)

Multi-Turn Interaction

The environment supports iterative problem-solving:

Model writes Python code
Code executes in sandbox
Model sees output and continues reasoning
Repeats until model provides final answer or hits max_turns

Metrics Tracked

correct_answer: 1.0 if answer matches ground truth, 0.0 otherwise
num_turns: Number of model-environment interactions
sandbox_ready_wait_time: Time to initialize sandbox (seconds)
sandbox_command_execution_time: Total time executing Python code
python_ready_wait_time: Time to start Python REPL

Advanced Usage

Custom Answer Extraction

Provide your own answer extraction logic:

def custom_extract_answer(text: str) -> str:
    """Extract answer from custom format."""
    if "ANSWER:" in text:
        return text.split("ANSWER:")[1].strip()
    return text

parser = vf.Parser(extract_fn=custom_extract_answer)
rubric = vf.MathRubric(parser=parser)
env = vf.PythonEnv(
    dataset=dataset,
    parser=parser,
    rubric=rubric,
    system_prompt="Solve the problem and format your answer as ANSWER: <value>"
)

Custom Reward Functions

Add additional reward signals:

async def efficiency_bonus(state, answer) -> float:
    """Reward shorter solutions."""
    num_turns = state.get("turn", 0)
    is_correct = state.get("completion", [])[-1].get("content", "").strip()
    if answer in is_correct and num_turns < 5:
        return 0.2  # Bonus for solving quickly
    return 0.0

math_rubric.add_reward_func(efficiency_bonus, weight=1.0)

GSM8K - Single-turn math reasoning without code execution
Wiki Search - Tool environment with custom tools
Browser Examples - More complex stateful environments

Next Steps

Learn about Environments to understand the architecture
See Sandboxes for more on containerized execution
Explore Rubrics for custom evaluation logic

Example Environments

Math with Python Execution

Overview

Complete Implementation

How It Works

1. Dataset Loading

2. System Prompt

3. Answer Parsing

4. Math Verification

5. Python Sandbox Environment

Example Interaction

Running the Environment

Installation

Quick Evaluation

Custom Configuration

Configuration Options

Key Features

Sandboxed Execution

Package Management

Multi-Turn Interaction

Metrics Tracked

Advanced Usage

Custom Answer Extraction

Custom Reward Functions

Next Steps

Build docs developers (and LLMs) love

Example Environments

Documentation Index

​Overview

​Complete Implementation

​How It Works

​1. Dataset Loading

​2. System Prompt

​3. Answer Parsing

​4. Math Verification

​5. Python Sandbox Environment

​Example Interaction

​Running the Environment

​Installation

​Quick Evaluation

​Custom Configuration

​Configuration Options

​Key Features

​Sandboxed Execution

​Package Management

​Multi-Turn Interaction

​Metrics Tracked

​Advanced Usage

​Custom Answer Extraction

​Custom Reward Functions

​Related Examples

​Next Steps

Build docs developers (and LLMs) love

Overview

Complete Implementation

How It Works

1. Dataset Loading

2. System Prompt

3. Answer Parsing

4. Math Verification

5. Python Sandbox Environment

Example Interaction

Running the Environment

Installation

Quick Evaluation

Custom Configuration

Configuration Options

Key Features

Sandboxed Execution

Package Management

Multi-Turn Interaction

Metrics Tracked

Advanced Usage

Custom Answer Extraction

Custom Reward Functions

Related Examples

Next Steps