String Rehearsal Benchmark

The string rehearsal benchmark evaluates a model’s ability to repeat a given string exactly as provided, without any modifications. This tests instruction following and the model’s tendency to “improve” or modify input unnecessarily.

What It Tests

This benchmark assesses:

Exact Repetition: Ability to output text character-for-character without changes
Instruction Adherence: Following the directive to not modify the input
Self-Control: Resisting the urge to correct, format, or explain
Long String Handling: Managing strings up to 500 characters

This benchmark is particularly revealing because many LLMs are trained to be “helpful” by reformatting or explaining, which causes them to fail this simple task.

Implementation Details

The benchmark is implemented in the string_rehearsal() function (main.py:272-335):

Random String Generation

For each iteration, a random alphanumeric string is generated:

stringlenth = random.randint(10, 500)
text = ''.join(random.choice(string.ascii_uppercase + string.digits + string.ascii_lowercase) 
               for _ in range(stringlenth))

stringlenth

int

Random length between 10 and 500 characters

text

string

Randomly generated string containing:

Uppercase letters (A-Z)
Lowercase letters (a-z)
Digits (0-9)

Unlike the string reversal benchmark which uses 2-30 characters, this benchmark tests much longer strings (up to 500 characters), making exact repetition more challenging.

Prompt Template

The model receives this exact prompt:

prompt = f"Repeat the following string exactly without modifying it. Don't output anything else. Only output the string without anything additional, not even quotes: \"{text}\""

The prompt uses the word “exactly” and explicitly states “without modifying it” to emphasize that no changes should be made.

Success Criteria

The benchmark validates responses using exact string matching:

if calresult["response"].strip() == text:    
    success = True
else:
    success = False

A response is marked as success only if:

The output exactly matches the input string character-for-character
Leading and trailing whitespace is stripped before comparison
No characters are added, removed, or modified
No additional text, quotes, or explanations are present

Example

Input String

aB7Xm9KpQrStUvWxYz123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Prompt Sent to Model

Repeat the following string exactly without modifying it. Don't output anything else. 
Only output the string without anything additional, not even quotes: 
"aB7Xm9KpQrStUvWxYz123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

Expected Output

aB7Xm9KpQrStUvWxYz123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Result Recording

Each test result is recorded with:

{
  "string": "aB7Xm9KpQrStUvWxYz123456789ABCDEFGHIJ...",
  "duration_seconds": 1.789,
  "response": "aB7Xm9KpQrStUvWxYz123456789ABCDEFGHIJ...",
  "model": "model-name",
  "status": "success",
  "reasoning": "optional reasoning trace"
}

Failure Cases

Common failure modes include:

Added Quotes: "aB7Xm9KpQrStUvWxYz..." (includes quotes)
Explanatory Text: Here is the string: aB7Xm9KpQrStUvWxYz...
Formatting Changes: Adding line breaks or spacing for “readability”
Character Substitution: Changing characters deemed “confusing” (like 0 vs O)
Truncation: Not outputting the full string for very long inputs
Case Changes: Converting to all uppercase or lowercase
Unicode Issues: Mangling or modifying character encoding

This benchmark reveals models that have been over-trained to be “helpful” at the expense of following explicit instructions. Models that fail this often try to format or explain the output.

Performance Metrics

The benchmark tracks:

Success Rate: Percentage of exact repetitions across all tries
Duration: Time taken for each response (longer strings may take more time)
Reasoning: Optional reasoning traces from reasoning-capable models
Failure Patterns: Types of modifications models tend to make

Why This Matters

String rehearsal is deceptively simple but tests critical capabilities:

Literal Instruction Following: Many tasks require exact output without interpretation
API Integration: Real-world APIs often require exact string formatting
Data Processing: ETL tasks need precise string handling
Code Generation: Programming requires exact syntax without “helpful” modifications

Models with high reasoning capabilities sometimes perform worse on this benchmark because they overthink the task. The best performance comes from models that can suppress their instinct to “improve” the output.

String Length Impact

Performance typically degrades with string length:

10-50 characters: Most models perform well
50-200 characters: Moderate difficulty, some models start adding explanations
200-500 characters: High difficulty, truncation and formatting changes more common

Results are logged to logs/log_[timestamp].txt and aggregated in results/result_[model]_[timestamp].json along with the other benchmark results.

Get Started

Benchmarks

Usage

API Reference

String Rehearsal Benchmark

What It Tests

Implementation Details

Random String Generation

Prompt Template

Success Criteria

Example

Input String

Prompt Sent to Model

Expected Output

Result Recording

Failure Cases

Performance Metrics

Why This Matters

String Length Impact

Build docs developers (and LLMs) love

Get Started

Benchmarks

Usage

API Reference

​What It Tests

​Implementation Details

​Random String Generation

​Prompt Template

​Success Criteria

​Example

​Input String

​Prompt Sent to Model

​Expected Output

​Result Recording

​Failure Cases

​Performance Metrics

​Why This Matters

​String Length Impact

Build docs developers (and LLMs) love

What It Tests

Implementation Details

Random String Generation

Prompt Template

Success Criteria

Example

Input String

Prompt Sent to Model

Expected Output

Result Recording

Failure Cases

Performance Metrics

Why This Matters

String Length Impact