Skip to main content
All configuration is done by editing variables at the top of main.py (lines 14-23). These variables control API connection, model behavior, and test execution.

API Configuration

llm
string
default:"\"\""
The model identifier to use for benchmarking. Leave empty ("") to select the currently loaded model. This only works with LM-Studio when left empty.Example:
llm = "gpt-4"
llm = ""  # Uses currently loaded model in LM-Studio
baseurl
string
default:"\"http://127.0.0.1:1234/v1\""
The base URL for the OpenAI-compatible API endpoint. Default points to a local LM-Studio instance.Example:
baseurl = "http://127.0.0.1:1234/v1"  # LM-Studio default
baseurl = "https://api.openai.com/v1"  # OpenAI API
reasoning_effort
string
default:"\"low\""
Controls the reasoning effort level for models that support reasoning. Affects how much the model thinks before responding.Valid values:
  • "low" - Minimal reasoning effort
  • "medium" - Moderate reasoning effort
  • "high" - Maximum reasoning effort
Example:
reasoning_effort = "low"
reasoning_effort = "medium"
reasoning_effort = "high"
This value is passed to the API as reasoning: {"effort": reasoning_effort, "summary": "detailed"}

Test Execution

tries
int
default:"100"
The number of test iterations to run for each benchmark. Each benchmark (String Reversal, Integer Addition, String Rehearsal) will execute this many times.Example:
tries = 100  # Runs 100 tests per benchmark (300 total tests)
tries = 50   # Runs 50 tests per benchmark (150 total tests)
timeout_time
int
default:"400"
Time in seconds to wait before stopping the current answer. Designed to prevent multi-thousand token responses when the LLM gets into a death spiral.Note: This feature is marked as “Todo: Implement this” in the source code (line 20-21) and is not currently enforced.Example:
timeout_time = 400  # 400 seconds (6.67 minutes)
max_tokens
int
default:"512"
Maximum number of output tokens the model can generate per response. The default is calculated as 512 * 1.Recommendation: You may want to increase this value when using reasoning LLMs, as they often require more tokens for their reasoning traces.Note: There’s a todo (line 22-23) to implement automatic increase/disabling for reasoning LLMs after a warmup run.Example:
max_tokens = 512      # Standard output limit
max_tokens = 1024     # Increased for reasoning models
max_tokens = 2048     # High limit for complex reasoning
This value is passed to the API as max_output_tokens=max_tokens

Directory Configuration

These are defined at the top of main.py (lines 10-11) but are not typically modified:
  • logs_directory = "logs" - Directory where log files are written
  • results_directory = "results" - Directory where result JSON files are saved

Configuration Example

Here’s a complete configuration example for a high-effort reasoning model:
# Model and API settings
llm = "deepseek-r1:32b"
baseurl = "http://127.0.0.1:1234/v1"
reasoning_effort = "high"

# Test execution settings
tries = 200
timeout_time = 600
max_tokens = 2048

Notes

  • All configuration changes require editing the source file main.py
  • Changes take effect the next time you run the simpe command
  • The configuration is saved in the result file header for reference (see Output Format)

Build docs developers (and LLMs) love