What It Tests
This benchmark assesses:- Character-level Manipulation: Ability to process strings character by character
- Output Precision: Following instructions to output only the reversed string
- Consistency: Performance across multiple random inputs
Implementation Details
The benchmark is implemented in thestring_reversal() function (main.py:128-191):
Random String Generation
For each iteration, a random alphanumeric string is generated:Random length between 2 and 30 characters
Randomly generated string containing:
- Uppercase letters (A-Z)
- Lowercase letters (a-z)
- Digits (0-9)
Prompt Template
The model receives this exact prompt:The prompt explicitly instructs the model to output only the reversed string with no additional text, explanations, or quotes.
Success Criteria
The benchmark validates responses using exact string matching:- The output exactly matches the reversed input string (using Python’s
[::-1]slice) - Leading and trailing whitespace is stripped before comparison
- No additional characters, quotes, or explanations are present
Example
Input String
Prompt Sent to Model
Expected Output
Result Recording
Each test result is recorded with:Failure Cases
Common failure modes include:- Extra Quotes:
"K9mX7Ba"(includes quotes) - Explanation:
The reversed string is: K9mX7Ba - Wrong Reversal:
aB7KmX9(incorrect character order) - Case Errors:
k9mx7ba(wrong case)
Performance Metrics
The benchmark tracks:- Success Rate: Percentage of correct reversals across all tries
- Duration: Time taken for each response
- Reasoning: Optional reasoning traces from reasoning-capable models
Results are logged to
logs/log_[timestamp].txt and aggregated in results/result_[model]_[timestamp].json