What It Tests
This benchmark assesses:- Exact Repetition: Ability to output text character-for-character without changes
- Instruction Adherence: Following the directive to not modify the input
- Self-Control: Resisting the urge to correct, format, or explain
- Long String Handling: Managing strings up to 500 characters
This benchmark is particularly revealing because many LLMs are trained to be “helpful” by reformatting or explaining, which causes them to fail this simple task.
Implementation Details
The benchmark is implemented in thestring_rehearsal() function (main.py:272-335):
Random String Generation
For each iteration, a random alphanumeric string is generated:Random length between 10 and 500 characters
Randomly generated string containing:
- Uppercase letters (A-Z)
- Lowercase letters (a-z)
- Digits (0-9)
Prompt Template
The model receives this exact prompt:The prompt uses the word “exactly” and explicitly states “without modifying it” to emphasize that no changes should be made.
Success Criteria
The benchmark validates responses using exact string matching:- The output exactly matches the input string character-for-character
- Leading and trailing whitespace is stripped before comparison
- No characters are added, removed, or modified
- No additional text, quotes, or explanations are present
Example
Input String
Prompt Sent to Model
Expected Output
Result Recording
Each test result is recorded with:Failure Cases
Common failure modes include:- Added Quotes:
"aB7Xm9KpQrStUvWxYz..."(includes quotes) - Explanatory Text:
Here is the string: aB7Xm9KpQrStUvWxYz... - Formatting Changes: Adding line breaks or spacing for “readability”
- Character Substitution: Changing characters deemed “confusing” (like
0vsO) - Truncation: Not outputting the full string for very long inputs
- Case Changes: Converting to all uppercase or lowercase
- Unicode Issues: Mangling or modifying character encoding
Performance Metrics
The benchmark tracks:- Success Rate: Percentage of exact repetitions across all tries
- Duration: Time taken for each response (longer strings may take more time)
- Reasoning: Optional reasoning traces from reasoning-capable models
- Failure Patterns: Types of modifications models tend to make
Why This Matters
String rehearsal is deceptively simple but tests critical capabilities:- Literal Instruction Following: Many tasks require exact output without interpretation
- API Integration: Real-world APIs often require exact string formatting
- Data Processing: ETL tasks need precise string handling
- Code Generation: Programming requires exact syntax without “helpful” modifications
Models with high reasoning capabilities sometimes perform worse on this benchmark because they overthink the task. The best performance comes from models that can suppress their instinct to “improve” the output.
String Length Impact
Performance typically degrades with string length:- 10-50 characters: Most models perform well
- 50-200 characters: Moderate difficulty, some models start adding explanations
- 200-500 characters: High difficulty, truncation and formatting changes more common
Results are logged to
logs/log_[timestamp].txt and aggregated in results/result_[model]_[timestamp].json along with the other benchmark results.