Documentation Index
Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
Use this file to discover all available pages before exploring further.
This example demonstrates the simplest type of Verifiers environment: single-turn question answering with symbolic verification. Models solve grade school math word problems using natural language reasoning.
Overview
The GSM8K environment provides:
- Dataset: 8.5K grade school math word problems
- Format: Single-turn response (no tools or code execution)
- Evaluation: Symbolic math verification of final answers
- Difficulty: Elementary to middle school level
Complete Implementation
Here’s the full working implementation from environments/gsm8k/gsm8k.py:
import verifiers as vf
from verifiers.utils.data_utils import (
BOXED_SYSTEM_PROMPT,
load_example_dataset,
)
def load_environment(
system_prompt: str = BOXED_SYSTEM_PROMPT,
num_train_examples=-1,
num_eval_examples=-1,
):
dataset = load_example_dataset("gsm8k", split="train")
if num_train_examples != -1:
dataset = dataset.select(range(num_train_examples))
eval_dataset = load_example_dataset("gsm8k", split="test")
if num_eval_examples != -1:
eval_dataset = eval_dataset.select(range(num_eval_examples))
rubric = vf.MathRubric()
vf_env = vf.SingleTurnEnv(
dataset=dataset,
eval_dataset=eval_dataset,
system_prompt=system_prompt,
parser=rubric.parser,
rubric=rubric,
)
return vf_env
How It Works
1. Dataset Loading
GSM8K contains grade school math word problems:
dataset = load_example_dataset("gsm8k", split="train")
eval_dataset = load_example_dataset("gsm8k", split="test")
Dataset size:
- Training: 7,473 problems
- Test: 1,319 problems
Dataset format:
{
"question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"answer": "18" # Extracted from "#### 18"
}
2. System Prompt
The default system prompt uses the standard BOXED_SYSTEM_PROMPT:
BOXED_SYSTEM_PROMPT = "Please reason step by step, and put your final answer within \\boxed{}."
This encourages:
- Step-by-step reasoning (chain of thought)
- Clear final answer formatting using
\boxed{}
3. Evaluation with MathRubric
MathRubric handles answer extraction and verification:
Features:
- Extracts answers from
\boxed{} notation
- Symbolic equivalence checking (handles different representations)
- Numerical tolerance for floating-point answers
- Returns 1.0 for correct, 0.0 for incorrect
4. Single-Turn Environment
SingleTurnEnv is the simplest environment type:
- Model receives question
- Model generates single response
- Response is evaluated
- No multi-turn interaction or tools
vf_env = vf.SingleTurnEnv(
dataset=dataset,
eval_dataset=eval_dataset,
system_prompt=system_prompt,
parser=rubric.parser,
rubric=rubric,
)
Example Interactions
Example 1: Ducks
Example 2: Store
Example 3: Incorrect
Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?Model Response:
Let me work through this step-by-step:
- Janet’s ducks lay 16 eggs per day
- She eats 3 eggs for breakfast
- She uses 4 eggs for muffins
- Eggs remaining: 16 - 3 - 4 = 9 eggs
- She sells each egg for $2
- Total earnings: 9 × 2=18
Therefore, Janet makes 18 dollars every day at the farmers’ market.Result: ✓ Correct (reward = 1.0) Question: A clothing store has an inventory of 34 ties, 40 belts, 63 black shirts, and 42 white shirts. The number of jeans in the store is two-thirds the sum of black and white shirts, and the number of scarves is half the number of the sum of ties and belts. How many more jeans are there than scarves?Model Response:
Let’s break this down:
-
Black shirts: 63
-
White shirts: 42
-
Sum of shirts: 63 + 42 = 105
-
Number of jeans: (2/3) × 105 = 70
-
Ties: 34
-
Belts: 40
-
Sum of ties and belts: 34 + 40 = 74
-
Number of scarves: 74 / 2 = 37
-
Difference: 70 - 37 = 33
The answer is 33.Result: ✓ Correct (reward = 1.0) Question: If there are 10 eggs in a basket and you take away 3, how many eggs do you have?Model Response:
If I start with 10 eggs and take away 3:
10 - 3 = 7I have 7 eggs.Ground Truth: 3 (the eggs you took)Result: ✗ Incorrect (reward = 0.0)Note: This is a trick question - you have the 3 eggs you took, not the 7 remaining in the basket.
Running the Environment
Installation
# Install from environments directory
prime env install gsm8k
Quick Evaluation
# Evaluate on 20 test problems with 5 rollouts each
prime eval run gsm8k \
-m openai/gpt-4.1-mini \
-b https://api.openai.com/v1 \
-k OPENAI_API_KEY \
-n 20 \
-r 5
Training Dataset
# Use training split for data generation
prime eval run gsm8k \
-m openai/gpt-4.1-mini \
--split train \
-n 100 \
-r 8
Custom Configuration
# Limit dataset size and customize prompt
prime eval run gsm8k \
-m openai/gpt-4.1-mini \
-a '{
"num_train_examples": 100,
"num_eval_examples": 50,
"system_prompt": "Solve the math problem step-by-step. Show your work and put your final numerical answer in \\\\boxed{}."
}' \
-n 50 \
-r 4
Configuration Options
| Parameter | Default | Description |
|---|
system_prompt | BOXED_SYSTEM_PROMPT | Instructions for the model |
num_train_examples | -1 | Number of training examples (-1 = all) |
num_eval_examples | -1 | Number of eval examples (-1 = all) |
Key Features
Separate Train/Eval Datasets
GSM8K demonstrates proper train/eval split:
dataset = load_example_dataset("gsm8k", split="train") # 7,473 problems
eval_dataset = load_example_dataset("gsm8k", split="test") # 1,319 problems
When running prime eval run, the eval dataset is used by default.
MathRubric handles various answer formats:
# All equivalent:
"\\boxed{18}" # Standard LaTeX
"\\boxed{18.0}" # Floating point
"\\boxed{18.00}" # Extra precision
"$\\boxed{18}$" # With dollar signs
Symbolic Verification
MathRubric uses symbolic math verification:
# These are all considered correct for answer "0.5":
"\\boxed{0.5}" # Decimal
"\\boxed{1/2}" # Fraction
"\\boxed{0.50}" # Extra zeros
"\\boxed{.5}" # No leading zero
Metrics Tracked
correct_answer: 1.0 if answer matches ground truth, 0.0 otherwise
reward: Same as correct_answer for this simple environment
Advanced Usage
Custom System Prompts
Experiment with different prompting strategies:
# Chain of thought with explicit steps
COT_PROMPT = """Solve this math problem step-by-step.
1. Identify what is being asked
2. List the given information
3. Perform the calculations
4. State your final answer in \\boxed{}
"""
env = load_environment(system_prompt=COT_PROMPT)
Filtering Dataset
Create custom dataset subsets:
import verifiers as vf
from verifiers.utils.data_utils import load_example_dataset
def load_environment():
dataset = load_example_dataset("gsm8k", split="test")
# Filter for problems containing specific keywords
def is_about_money(example):
question = example["question"].lower()
return "dollar" in question or "$" in question
dataset = dataset.filter(is_about_money)
rubric = vf.MathRubric()
return vf.SingleTurnEnv(
dataset=dataset,
system_prompt="Solve this money-related math problem.",
rubric=rubric,
)
Adding Metrics
Track additional metrics beyond correctness:
import verifiers as vf
def load_environment():
dataset = load_example_dataset("gsm8k", split="test")
rubric = vf.MathRubric()
# Track response length
async def response_length(completion) -> float:
return float(len(completion[-1]["content"]))
# Track number of steps
async def num_steps(completion) -> float:
content = completion[-1]["content"]
# Count numbered steps like "1.", "2.", etc.
import re
steps = re.findall(r'^\d+\.', content, re.MULTILINE)
return float(len(steps))
rubric.add_metric(response_length)
rubric.add_metric(num_steps)
return vf.SingleTurnEnv(
dataset=dataset,
rubric=rubric,
)
Comparison with Math Python
| Feature | GSM8K | Math Python |
|---|
| Interaction | Single-turn | Multi-turn |
| Tools | None | Python REPL |
| Sandbox | No | Yes (containerized) |
| Difficulty | Elementary/Middle school | Competition math |
| Reasoning | Natural language | Code + reasoning |
| Startup time | Instant | ~10-30s (sandbox) |
| Use case | Quick evaluation | Complex computation |
Next Steps