Integer Addition Benchmark

The integer addition benchmark evaluates a model’s arithmetic capabilities by testing addition of large randomly-generated integers. This benchmark is particularly challenging because it tests numerical reasoning with integers that can be up to 30 digits long.

What It Tests

This benchmark assesses:

Large Integer Arithmetic: Ability to correctly add numbers with 2-30 digits
Numerical Accuracy: Precise calculation without rounding or approximation errors
Output Format Compliance: Returning only the numeric result without explanations

Implementation Details

The benchmark is implemented in the add_two_ints() function (main.py:194-269):

Random Integer Generation

For each iteration, two random integers are generated:

int1_length = random.randint(2, 30)
int2_length = random.randint(2, 30)

int1 = int(''.join(random.choice(string.digits) for _ in range(int1_length)))
int2 = int(''.join(random.choice(string.digits) for _ in range(int2_length)))

int1_length

int

Random length between 2 and 30 digits for the first integer

int2_length

int

Random length between 2 and 30 digits for the second integer

Integers can be up to 30 digits long, which is significantly larger than standard 64-bit integer limits. This tests whether models can handle arithmetic beyond typical programming language integer sizes.

Prompt Template

The model receives this exact prompt:

prompt = f"Provide the sum of the two numbers. Don't output anything else. Only output the sum of the two numbers without anything additional. Only output the final number, no calculation, no explanation, just the final number without any text.: \"{int1}\" \"{int2}\""

The prompt explicitly instructs the model to output only the numeric sum with no calculations, working, explanations, or additional text.

Success Criteria

The benchmark validates responses using integer comparison with error handling:

errval = False

try:
    if int(calresult["response"].strip()) == int1 + int2:    
        success = True
    else:
        success = False
except (ValueError, TypeError):
    success = False
    log_message(f"Test {t+1} failed. Output contained non-numeric characters")
    errval = True

A response is marked as success only if:

The output can be parsed as an integer (no non-numeric characters)
The parsed integer exactly equals int1 + int2
Leading and trailing whitespace is stripped before parsing

Error Handling

The benchmark handles two types of failures:

ValueError/TypeError: Output contains non-numeric characters or cannot be parsed
Incorrect Sum: Output is numeric but mathematically incorrect

Responses with explanations, calculations, or any non-numeric characters will fail with a ValueError and be logged as containing non-numeric characters.

Example

Input Integers

int1 = 123456789012345
int2 = 987654321098765

Prompt Sent to Model

Provide the sum of the two numbers. Don't output anything else. 
Only output the sum of the two numbers without anything additional. 
Only output the final number, no calculation, no explanation, 
just the final number without any text.: "123456789012345" "987654321098765"

Expected Output

1111111110111110

Result Recording

Each test result is recorded with:

{
  "int1": 123456789012345,
  "int2": 987654321098765,
  "duration_seconds": 2.456,
  "response": "1111111110111110",
  "model": "model-name",
  "status": "success",
  "reasoning": "optional reasoning trace"
}

Failure Cases

Common failure modes include:

Showing Work: 123456789012345 + 987654321098765 = 1111111110111110
Explanation: The sum is 1111111110111110
Incorrect Calculation: 1111111110111111 (off by one)
Scientific Notation: 1.11111111e15 (not accepted)
Rounding Errors: 1111111110111000 (lost precision)
Non-numeric Output: One trillion, one hundred eleven billion...

Performance Metrics

The benchmark tracks:

Success Rate: Percentage of correct additions across all tries
Duration: Time taken for each calculation
Error Types: Whether failures are due to parsing errors or incorrect arithmetic
Reasoning: Optional reasoning traces from reasoning-capable models

This benchmark is particularly revealing for identifying models that struggle with large number arithmetic or have trouble following strict output format requirements.

Difficulty Factors

The challenge of this benchmark increases with:

Number Size: Larger integers (approaching 30 digits) are more difficult
Carry Operations: Numbers requiring many carry operations are harder
Output Discipline: Models must resist explaining their work

Get Started

Benchmarks

Usage

API Reference

Integer Addition Benchmark

What It Tests

Implementation Details

Random Integer Generation

Prompt Template

Success Criteria

Error Handling

Example

Input Integers

Prompt Sent to Model

Expected Output

Result Recording

Failure Cases

Performance Metrics

Difficulty Factors

Build docs developers (and LLMs) love

Get Started

Benchmarks

Usage

API Reference

​What It Tests

​Implementation Details

​Random Integer Generation

​Prompt Template

​Success Criteria

​Error Handling

​Example

​Input Integers

​Prompt Sent to Model

​Expected Output

​Result Recording

​Failure Cases

​Performance Metrics

​Difficulty Factors

Build docs developers (and LLMs) love

What It Tests

Implementation Details

Random Integer Generation

Prompt Template

Success Criteria

Error Handling

Example

Input Integers

Prompt Sent to Model

Expected Output

Result Recording

Failure Cases

Performance Metrics

Difficulty Factors