Result File
Benchmark results are saved to JSON files in theresults/ directory.
File Naming
{model}- The model name with forward slashes removed (e.g.,gpt-4ordeepseek-r132b){timestamp}- Run start time in formatYYYY-MM-DD_HH-MM-SS
result_gpt-4_2026-03-03_14-30-45.json
JSON Structure
The result file contains two main sections:header and benchmarkresults.
Header Format
The header is built by thebuild_header() function (main.py:110-122) and contains metadata about the benchmark run.
Metadata about the benchmark run configuration and timing.
Benchmark Results Structure
Thebenchmarkresults object contains results for three benchmark types. Each benchmark has the same structure.
Contains results for all benchmark tests.
Benchmark Object
Each benchmark type follows this structure:Human-readable name of the benchmark test. One of:
"String Reversal"- Tests reversing random strings"Add two intigers"- Tests adding two large integers (note: typo in source)"String Rehearsal"- Tests repeating strings exactly
The number of test iterations configured for this benchmark run.
Array of individual test result objects. See Test Result Object for structure.
Test Result Object
Each individual test result contains the input, output, timing, and status information.String Reversal Result
The random input string that was provided to the model to reverse (2-30 characters).
The model’s actual response output.
Time in seconds that the model took to generate the response, including any reasoning time.
The model’s reasoning trace, if the model supports reasoning and produced a reasoning output. Only present for reasoning-capable models.
The actual model identifier returned by the API for this response.
Test result status. One of:
"success"- Model output matched expected result"fail"- Model output did not match expected result
Integer Addition Result
The first random integer provided to the model (2-30 digits).
The second random integer provided to the model (2-30 digits).
The model’s response, which should contain only the sum.
Time in seconds that the model took to generate the response.
The model’s reasoning trace, if available.
The actual model identifier returned by the API.
Test result status (
"success" or "fail").String Rehearsal Result
The random input string that was provided to the model to repeat exactly (10-500 characters).
The model’s actual response output.
Time in seconds that the model took to generate the response.
The model’s reasoning trace, if available.
The actual model identifier returned by the API.
Test result status (
"success" or "fail").Complete Example
Log Files
Logs are written to thelogs/ directory.
Log File Naming
- Timestamped Log:
log_YYYY-MM-DD_HH-MM-SS.txt- Persistent log for each run - Recent Log:
log_recent.txt- Cleared and overwritten on each new run for easy access
Log Format
Each log entry has the format:Log Contents
Log files include:- Benchmark start messages (e.g., “Starting new string reversal eval with 100 tries”)
- Individual test start messages (e.g., “Starting test 0”)
- Success messages (e.g., “Success: 1”)
- Failure messages with expected vs. actual output
- Reasoning traces from models (prefixed with
[Reasoning]:) - API errors and exceptions