Numerically Stable Softmax in C: Three-Pass Algorithm

Softmax converts a vector of raw logit scores into a probability distribution: all outputs are in the range (0, 1) and they sum to exactly 1.0. The naive formula — compute exp(x[i]) for every element and divide by the sum — fails in practice because exp(large_number) overflows to infinity in float32. The numerically stable version subtracts the maximum value before exponentiating, which keeps every input to exp at or below zero and therefore bounded by 1. The mathematical result is identical because the subtracted constant cancels in the numerator and denominator.

Why Numerical Stability Matters

Consider an input vector [1000, 1001, 1002]. The naive softmax computes exp(1000), exp(1001), exp(1002) — all of which are +inf in 32-bit float. The result is NaN (inf / inf). The stable version subtracts the max (1002) first, giving [-2, -1, 0], then computes exp(-2), exp(-1), exp(0) — all perfectly representable finite values. The output probabilities are exactly the same as the mathematically correct answer.

exp(x[i] - max) / sum(exp(x[j] - max))
= exp(x[i]) * exp(-max) / (sum(exp(x[j])) * exp(-max))
= exp(x[i]) / sum(exp(x[j]))

The exp(-max) factor cancels. Stability is free.

Function Signature

void softmax(float* out, float* x, float* weight, float* bias,
             float eps, int B, int T, int C);

Parameter	Shape	Description
`out`	`(B, T, C)`	Output probabilities — each token’s C-dim distribution
`x`	`(B, T, C)`	Input logits — one C-dim vector per token
`weight`	`(C,)`	Unused in softmax; included for API consistency with layernorm
`bias`	`(C,)`	Unused in softmax; included for API consistency with layernorm
`eps`	—	Unused in softmax; included for API consistency with layernorm
`B`	—	Batch size
`T`	—	Sequence length
`C`	—	Number of classes / logit dimension

The Three-Pass Algorithm

Find the maximum value

Scan all C elements of the token’s vector and track the largest value. This is the value that will be subtracted before exponentiation.

float max_val = x_bt[0];
for (int i = 1; i < C; i++) {
    if (x_bt[i] > max_val) max_val = x_bt[i];
}

Subtract max, exponentiate, and sum

For each element, compute exp(x[i] - max_val) and write it to the output buffer while accumulating the sum. This single pass both fills the output with un-normalized probabilities and computes the denominator.

float sum = 0;
for (int i = 0; i < C; i++) {
    sum += exp(x_bt[i] - max_val);
    out_bt[i] = exp(x_bt[i] - max_val);
}

Normalize by the sum

Divide each output element by the accumulated sum to produce a proper probability distribution.

for (int i = 0; i < C; i++) {
    out_bt[i] = out_bt[i] / sum;
}

Full Implementation

void softmax(float* out, float* x, float* weight, float* bias,
             float eps, int B, int T, int C) {

    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // Get a pointer to token (b, t)'s C-dim vector
            float* x_bt = x + b * T * C + t * C;

            // Step 1: Find the max value over C elements
            float max_val = x_bt[0];
            for (int i = 1; i < C; i++) {
                if (x_bt[i] > max_val) max_val = x_bt[i];
            }

            // Step 2: Subtract max, take exp of each, sum them up
            float sum = 0;
            float* out_bt = out + b * T * C + t * C;
            for (int i = 0; i < C; i++) {
                sum += exp(x_bt[i] - max_val);
                out_bt[i] = exp(x_bt[i] - max_val);
            }

            // Step 3: Divide each by the sum
            for (int i = 0; i < C; i++) {
                out_bt[i] = out_bt[i] / sum;
            }
        }
    }
}

Worked Example

int main() {
    float A[]    = {1, 2, 3, 4};   // input logits, shape (1, 1, 4)
    float out[4] = {0};
    float weights[] = {1, 1, 1, 1};
    float bias[]    = {1, 2, 3, 4};
    float eps = 1e-5;

    softmax(out, A, weights, bias, eps, 1, 1, 4);

    for (int i = 0; i < 4; i++) {
        cout << out[i] << endl;
    }
}

Output:

The probabilities sum to 1.0. The largest logit (4) receives the largest probability (~0.64) and the smallest logit (1) receives the smallest (~0.03). The distribution is exponentially peaked — a small difference in logit value produces a large difference in probability.

The same three-pass pattern appears inside attention_forward in train_gpt2_annotated.c for normalizing the Q·K attention scores. There, the input is a T-dimensional row of raw scores for one query token across all key positions, and the softmax is computed over that row to produce the attention weights. The max-subtract is done during the score computation pass (pass 1), the exp-and-sum in pass 2, and the normalization in pass 3 — directly mirroring the standalone softmax kernel here.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Numerically Stable Softmax in C: Three-Pass Algorithm

Why Numerical Stability Matters

Function Signature

The Three-Pass Algorithm

Full Implementation

Worked Example

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

​Why Numerical Stability Matters

​Function Signature

​The Three-Pass Algorithm

​Full Implementation

​Worked Example

Build docs developers (and LLMs) love

Why Numerical Stability Matters

Function Signature

The Three-Pass Algorithm

Full Implementation

Worked Example