What Benchmarks Test
The benchmark suite evaluates models on fundamental capabilities through three distinct tests. Each benchmark measures the model’s ability to follow precise instructions and produce exact output without adding extraneous information:- String Manipulation: Tests the model’s ability to accurately reverse strings
- Mathematical Reasoning: Validates arithmetic capabilities with large integers
- Exact Replication: Measures how precisely models can reproduce strings without modification
Available Benchmarks
String Reversal
Tests string reversal capability with random alphanumeric strings
Integer Addition
Evaluates large integer arithmetic accuracy
String Rehearsal
Validates exact string repetition without modification
General Configuration
All benchmarks share common configuration parameters defined inmain.py:
Number of iterations to run for each benchmark. More tries provide more statistically significant results.
Time in seconds to wait before stopping a response. Prevents infinite loops when models enter “death spirals”.
Maximum output tokens allowed per response. Can be increased for reasoning models.
Reasoning effort level for models that support it (e.g., “low”, “medium”, “high”).
How Benchmarks Work
Each benchmark follows this workflow:- Generate Random Input: Creates random test data (strings, integers, etc.)
- Construct Prompt: Builds a specific prompt instructing the model on the expected output
- API Call: Sends the prompt to the LLM via OpenAI-compatible API
- Validation: Compares the model’s response against expected output
- Logging: Records success/failure, duration, and reasoning (if available)
All benchmark results are saved to JSON files in the
results/ directory with detailed logs in the logs/ directory.Result Structure
Each benchmark produces a structured JSON output containing:Success Criteria
Benchmarks use strict validation:- Exact Match: Most benchmarks require exact string matching (after stripping whitespace)
- No Extra Output: Models must not include explanations, quotes, or additional text
- Numeric Validation: Integer benchmarks validate that output is parseable as a number