GSM8K Math Reasoning

This example demonstrates the simplest type of Verifiers environment: single-turn question answering with symbolic verification. Models solve grade school math word problems using natural language reasoning.

Overview

The GSM8K environment provides:

Dataset: 8.5K grade school math word problems
Format: Single-turn response (no tools or code execution)
Evaluation: Symbolic math verification of final answers
Difficulty: Elementary to middle school level

Complete Implementation

Here’s the full working implementation from environments/gsm8k/gsm8k.py:

import verifiers as vf
from verifiers.utils.data_utils import (
    BOXED_SYSTEM_PROMPT,
    load_example_dataset,
)


def load_environment(
    system_prompt: str = BOXED_SYSTEM_PROMPT,
    num_train_examples=-1,
    num_eval_examples=-1,
):
    dataset = load_example_dataset("gsm8k", split="train")
    if num_train_examples != -1:
        dataset = dataset.select(range(num_train_examples))
    eval_dataset = load_example_dataset("gsm8k", split="test")
    if num_eval_examples != -1:
        eval_dataset = eval_dataset.select(range(num_eval_examples))

    rubric = vf.MathRubric()
    vf_env = vf.SingleTurnEnv(
        dataset=dataset,
        eval_dataset=eval_dataset,
        system_prompt=system_prompt,
        parser=rubric.parser,
        rubric=rubric,
    )
    return vf_env

How It Works

1. Dataset Loading

GSM8K contains grade school math word problems:

dataset = load_example_dataset("gsm8k", split="train")
eval_dataset = load_example_dataset("gsm8k", split="test")

Dataset size:

Training: 7,473 problems
Test: 1,319 problems

Dataset format:

{
    "question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
    "answer": "18"  # Extracted from "#### 18"
}

2. System Prompt

The default system prompt uses the standard BOXED_SYSTEM_PROMPT:

BOXED_SYSTEM_PROMPT = "Please reason step by step, and put your final answer within \\boxed{}."

This encourages:

Step-by-step reasoning (chain of thought)
Clear final answer formatting using \boxed{}

3. Evaluation with MathRubric

MathRubric handles answer extraction and verification:

rubric = vf.MathRubric()

Features:

Extracts answers from \boxed{} notation
Symbolic equivalence checking (handles different representations)
Numerical tolerance for floating-point answers
Returns 1.0 for correct, 0.0 for incorrect

4. Single-Turn Environment

SingleTurnEnv is the simplest environment type:

Model receives question
Model generates single response
Response is evaluated
No multi-turn interaction or tools

vf_env = vf.SingleTurnEnv(
    dataset=dataset,
    eval_dataset=eval_dataset,
    system_prompt=system_prompt,
    parser=rubric.parser,
    rubric=rubric,
)

Example Interactions

Example 1: Ducks
Example 2: Store
Example 3: Incorrect

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?Model Response: Let me work through this step-by-step:

Janet’s ducks lay 16 eggs per day
She eats 3 eggs for breakfast
She uses 4 eggs for muffins
Eggs remaining: 16 - 3 - 4 = 9 eggs
She sells each egg for $2
Total earnings: 9 × $2 =$ 18

Therefore, Janet makes

\boxed{18}

dollars every day at the farmers’ market.Result: ✓ Correct (reward = 1.0)

Running the Environment

Installation

# Install from environments directory
prime env install gsm8k

Quick Evaluation

# Evaluate on 20 test problems with 5 rollouts each
prime eval run gsm8k \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 20 \
  -r 5

Training Dataset

# Use training split for data generation
prime eval run gsm8k \
  -m openai/gpt-4.1-mini \
  --split train \
  -n 100 \
  -r 8

Custom Configuration

# Limit dataset size and customize prompt
prime eval run gsm8k \
  -m openai/gpt-4.1-mini \
  -a '{
    "num_train_examples": 100,
    "num_eval_examples": 50,
    "system_prompt": "Solve the math problem step-by-step. Show your work and put your final numerical answer in \\\\boxed{}."
  }' \
  -n 50 \
  -r 4

Configuration Options

Parameter	Default	Description
`system_prompt`	`BOXED_SYSTEM_PROMPT`	Instructions for the model
`num_train_examples`	`-1`	Number of training examples (-1 = all)
`num_eval_examples`	`-1`	Number of eval examples (-1 = all)

Key Features

Separate Train/Eval Datasets

GSM8K demonstrates proper train/eval split:

dataset = load_example_dataset("gsm8k", split="train")        # 7,473 problems
eval_dataset = load_example_dataset("gsm8k", split="test")   # 1,319 problems

When running prime eval run, the eval dataset is used by default.

Answer Format Flexibility

MathRubric handles various answer formats:

# All equivalent:
"\\boxed{18}"      # Standard LaTeX
"\\boxed{18.0}"    # Floating point
"\\boxed{18.00}"   # Extra precision
"$\\boxed{18}$"    # With dollar signs

Symbolic Verification

MathRubric uses symbolic math verification:

# These are all considered correct for answer "0.5":
"\\boxed{0.5}"     # Decimal
"\\boxed{1/2}"     # Fraction
"\\boxed{0.50}"    # Extra zeros
"\\boxed{.5}"      # No leading zero

Metrics Tracked

correct_answer: 1.0 if answer matches ground truth, 0.0 otherwise
reward: Same as correct_answer for this simple environment

Advanced Usage

Custom System Prompts

Experiment with different prompting strategies:

# Chain of thought with explicit steps
COT_PROMPT = """Solve this math problem step-by-step.

1. Identify what is being asked
2. List the given information
3. Perform the calculations
4. State your final answer in \\boxed{}
"""

env = load_environment(system_prompt=COT_PROMPT)

Filtering Dataset

Create custom dataset subsets:

import verifiers as vf
from verifiers.utils.data_utils import load_example_dataset

def load_environment():
    dataset = load_example_dataset("gsm8k", split="test")
    
    # Filter for problems containing specific keywords
    def is_about_money(example):
        question = example["question"].lower()
        return "dollar" in question or "$" in question
    
    dataset = dataset.filter(is_about_money)
    
    rubric = vf.MathRubric()
    return vf.SingleTurnEnv(
        dataset=dataset,
        system_prompt="Solve this money-related math problem.",
        rubric=rubric,
    )

Adding Metrics

Track additional metrics beyond correctness:

import verifiers as vf

def load_environment():
    dataset = load_example_dataset("gsm8k", split="test")
    rubric = vf.MathRubric()
    
    # Track response length
    async def response_length(completion) -> float:
        return float(len(completion[-1]["content"]))
    
    # Track number of steps
    async def num_steps(completion) -> float:
        content = completion[-1]["content"]
        # Count numbered steps like "1.", "2.", etc.
        import re
        steps = re.findall(r'^\d+\.', content, re.MULTILINE)
        return float(len(steps))
    
    rubric.add_metric(response_length)
    rubric.add_metric(num_steps)
    
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=rubric,
    )

Comparison with Math Python

Feature	GSM8K	Math Python
Interaction	Single-turn	Multi-turn
Tools	None	Python REPL
Sandbox	No	Yes (containerized)
Difficulty	Elementary/Middle school	Competition math
Reasoning	Natural language	Code + reasoning
Startup time	Instant	~10-30s (sandbox)
Use case	Quick evaluation	Complex computation

Math Python - Math problems with code execution
Wordle - Multi-turn game environment
Wiki Search - Tool-based question answering

Next Steps

Learn about SingleTurnEnv architecture
Explore MathRubric for symbolic verification
See Datasets for loading and formatting data

Example Environments

GSM8K Math Reasoning

Overview

Complete Implementation

How It Works

1. Dataset Loading

2. System Prompt

3. Evaluation with MathRubric

4. Single-Turn Environment

Example Interactions

Running the Environment

Installation

Quick Evaluation

Training Dataset

Custom Configuration

Configuration Options

Key Features

Separate Train/Eval Datasets

Answer Format Flexibility

Symbolic Verification

Metrics Tracked

Advanced Usage

Custom System Prompts

Filtering Dataset

Adding Metrics

Comparison with Math Python

Next Steps

Build docs developers (and LLMs) love

Example Environments

Documentation Index

​Overview

​Complete Implementation

​How It Works

​1. Dataset Loading

​2. System Prompt

​3. Evaluation with MathRubric

​4. Single-Turn Environment

​Example Interactions

​Running the Environment

​Installation

​Quick Evaluation

​Training Dataset

​Custom Configuration

​Configuration Options

​Key Features

​Separate Train/Eval Datasets

​Answer Format Flexibility

​Symbolic Verification

​Metrics Tracked

​Advanced Usage

​Custom System Prompts

​Filtering Dataset

​Adding Metrics

​Comparison with Math Python

​Related Examples

​Next Steps

Build docs developers (and LLMs) love

Overview

Complete Implementation

How It Works

1. Dataset Loading

2. System Prompt

3. Evaluation with MathRubric

4. Single-Turn Environment

Example Interactions

Running the Environment

Installation

Quick Evaluation

Training Dataset

Custom Configuration

Configuration Options

Key Features

Separate Train/Eval Datasets

Answer Format Flexibility

Symbolic Verification

Metrics Tracked

Advanced Usage

Custom System Prompts

Filtering Dataset

Adding Metrics

Comparison with Math Python

Related Examples

Next Steps