simpe
The main command to run the simpE benchmark suite.main:main (from main.py)
Behavior
When you runuv run simpe, the tool:
- Creates
logs/andresults/directories if they don’t exist - Removes the previous
log_recent.txtfile - Runs three benchmark tests in sequence:
- String Reversal: Tests the model’s ability to reverse random strings (2-30 characters)
- Integer Addition: Tests the model’s ability to add two large integers (2-30 digits each)
- String Rehearsal: Tests the model’s ability to repeat strings exactly (10-500 characters)
- Executes each benchmark for the configured number of tries (default: 100)
- Saves results to
results/result_{model}_{timestamp}.json - Logs all activity to
logs/log_{timestamp}.txtandlogs/log_recent.txt
Configuration
The command is configured by editing the variables at the top ofmain.py (lines 14-23). See Configuration Options for details.
Output
The command provides real-time console output showing:- Current benchmark name
- Progress (completed/total tries)
- Success percentage
- Thinking time (when model is reasoning)
analyze
The analysis command to review and get statistics from benchmark results.analyze_results:main (from analyze_results.py)
Behavior
When you runuv run analyze, the tool:
- Scans the
results/directory for all.jsonresult files - Presents an interactive selection menu to choose which result file to analyze
- Displays comprehensive statistics:
Accuracy Analysis
Shows success percentage for each benchmark type:- String Reversal accuracy
- Integer Addition accuracy
- String Rehearsal accuracy
Reasoning Pattern Analysis
Counts occurrences of specific reasoning patterns per response:- “wait”
- “pause”
- “hold on”
- “actually”
- “no,“
Reasoning Length Statistics
For each benchmark, displays:- Character Statistics: Average, median, minimum, and maximum characters in reasoning traces
- Word Count Statistics: Average, median, minimum, and maximum word count
- Word Length: Average and median word length across all reasoning traces
Requirements
- At least one result file in the
results/directory - Result files must contain reasoning traces for full statistics (optional for accuracy analysis)
Interactive Interface
Usesquestionary to provide a user-friendly selection interface with arrow key navigation.