Benchmarks Overview

The simpE CLI tool includes a comprehensive benchmark suite designed to evaluate Large Language Models (LLMs) on fundamental tasks. These benchmarks test core capabilities that are essential for reliable model performance.

What Benchmarks Test

The benchmark suite evaluates models on fundamental capabilities through three distinct tests. Each benchmark measures the model’s ability to follow precise instructions and produce exact output without adding extraneous information:

String Manipulation: Tests the model’s ability to accurately reverse strings
Mathematical Reasoning: Validates arithmetic capabilities with large integers
Exact Replication: Measures how precisely models can reproduce strings without modification

Each benchmark runs multiple iterations with randomly generated inputs to provide statistically meaningful results.

Available Benchmarks

String Reversal

Tests string reversal capability with random alphanumeric strings

Integer Addition

Evaluates large integer arithmetic accuracy

String Rehearsal

Validates exact string repetition without modification

General Configuration

All benchmarks share common configuration parameters defined in main.py:

tries

int

default:"100"

Number of iterations to run for each benchmark. More tries provide more statistically significant results.

timeout_time

int

default:"400"

Time in seconds to wait before stopping a response. Prevents infinite loops when models enter “death spirals”.

max_tokens

int

default:"512"

Maximum output tokens allowed per response. Can be increased for reasoning models.

reasoning_effort

string

default:"low"

Reasoning effort level for models that support it (e.g., “low”, “medium”, “high”).

How Benchmarks Work

Each benchmark follows this workflow:

Generate Random Input: Creates random test data (strings, integers, etc.)
Construct Prompt: Builds a specific prompt instructing the model on the expected output
API Call: Sends the prompt to the LLM via OpenAI-compatible API
Validation: Compares the model’s response against expected output
Logging: Records success/failure, duration, and reasoning (if available)

All benchmark results are saved to JSON files in the results/ directory with detailed logs in the logs/ directory.

Result Structure

Each benchmark produces a structured JSON output containing:

{
  "test_type": "Benchmark Name",
  "tries": 100,
  "results": [
    {
      "status": "success" | "fail",
      "duration_seconds": 1.234,
      "response": "model output",
      "model": "model-name",
      "reasoning": "reasoning trace (if available)"
    }
  ]
}

Success Criteria

Benchmarks use strict validation:

Exact Match: Most benchmarks require exact string matching (after stripping whitespace)
No Extra Output: Models must not include explanations, quotes, or additional text
Numeric Validation: Integer benchmarks validate that output is parseable as a number

Models that produce explanations or additional text will fail even if the correct answer is included somewhere in the response.

Get Started

Benchmarks

Usage

API Reference

Benchmarks Overview

What Benchmarks Test

Available Benchmarks

String Reversal

Integer Addition

String Rehearsal

General Configuration

How Benchmarks Work

Result Structure

Success Criteria

Build docs developers (and LLMs) love

Get Started

Benchmarks

Usage

API Reference

​What Benchmarks Test

​Available Benchmarks

String Reversal

Integer Addition

String Rehearsal

​General Configuration

​How Benchmarks Work

​Result Structure

​Success Criteria

Build docs developers (and LLMs) love

What Benchmarks Test

Available Benchmarks

General Configuration

How Benchmarks Work

Result Structure

Success Criteria