CLI Commands

simpe

The main command to run the simpE benchmark suite.

uv run simpe

Entry Point: main:main (from main.py)

Behavior

When you run uv run simpe, the tool:

Creates logs/ and results/ directories if they don’t exist
Removes the previous log_recent.txt file
Runs three benchmark tests in sequence:
- String Reversal: Tests the model’s ability to reverse random strings (2-30 characters)
- Integer Addition: Tests the model’s ability to add two large integers (2-30 digits each)
- String Rehearsal: Tests the model’s ability to repeat strings exactly (10-500 characters)
Executes each benchmark for the configured number of tries (default: 100)
Saves results to results/result_{model}_{timestamp}.json
Logs all activity to logs/log_{timestamp}.txt and logs/log_recent.txt

Configuration

The command is configured by editing the variables at the top of main.py (lines 14-23). See Configuration Options for details.

Output

The command provides real-time console output showing:

Current benchmark name
Progress (completed/total tries)
Success percentage
Thinking time (when model is reasoning)

Final results are saved in JSON format. See Output Format for the complete structure.

analyze

The analysis command to review and get statistics from benchmark results.

uv run analyze

Entry Point: analyze_results:main (from analyze_results.py)

Behavior

When you run uv run analyze, the tool:

Scans the results/ directory for all .json result files
Presents an interactive selection menu to choose which result file to analyze
Displays comprehensive statistics:

Accuracy Analysis

Shows success percentage for each benchmark type:

String Reversal accuracy
Integer Addition accuracy
String Rehearsal accuracy

Reasoning Pattern Analysis

Counts occurrences of specific reasoning patterns per response:

“wait”
“pause”
“hold on”
“actually”
“no,“

Reasoning Length Statistics

For each benchmark, displays:

Character Statistics: Average, median, minimum, and maximum characters in reasoning traces
Word Count Statistics: Average, median, minimum, and maximum word count
Word Length: Average and median word length across all reasoning traces

Requirements

At least one result file in the results/ directory
Result files must contain reasoning traces for full statistics (optional for accuracy analysis)

Interactive Interface

Uses questionary to provide a user-friendly selection interface with arrow key navigation.

Get Started

Benchmarks

Usage

API Reference

CLI Commands

simpe

Behavior

Configuration

Output

analyze

Behavior

Accuracy Analysis

Reasoning Pattern Analysis

Reasoning Length Statistics

Requirements

Interactive Interface

Build docs developers (and LLMs) love

Get Started

Benchmarks

Usage

API Reference

​simpe

​Behavior

​Configuration

​Output

​analyze

​Behavior

​Accuracy Analysis

​Reasoning Pattern Analysis

​Reasoning Length Statistics

​Requirements

​Interactive Interface

Build docs developers (and LLMs) love

simpe

Behavior

Configuration

Output

analyze

Behavior

Accuracy Analysis

Reasoning Pattern Analysis

Reasoning Length Statistics

Requirements

Interactive Interface