Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

Softmax converts a vector of raw logit scores into a probability distribution: all outputs are in the range (0, 1) and they sum to exactly 1.0. The naive formula — compute exp(x[i]) for every element and divide by the sum — fails in practice because exp(large_number) overflows to infinity in float32. The numerically stable version subtracts the maximum value before exponentiating, which keeps every input to exp at or below zero and therefore bounded by 1. The mathematical result is identical because the subtracted constant cancels in the numerator and denominator.

Why Numerical Stability Matters

Consider an input vector [1000, 1001, 1002]. The naive softmax computes exp(1000), exp(1001), exp(1002) — all of which are +inf in 32-bit float. The result is NaN (inf / inf). The stable version subtracts the max (1002) first, giving [-2, -1, 0], then computes exp(-2), exp(-1), exp(0) — all perfectly representable finite values. The output probabilities are exactly the same as the mathematically correct answer.
exp(x[i] - max) / sum(exp(x[j] - max))
= exp(x[i]) * exp(-max) / (sum(exp(x[j])) * exp(-max))
= exp(x[i]) / sum(exp(x[j]))
The exp(-max) factor cancels. Stability is free.

Function Signature

void softmax(float* out, float* x, float* weight, float* bias,
             float eps, int B, int T, int C);
ParameterShapeDescription
out(B, T, C)Output probabilities — each token’s C-dim distribution
x(B, T, C)Input logits — one C-dim vector per token
weight(C,)Unused in softmax; included for API consistency with layernorm
bias(C,)Unused in softmax; included for API consistency with layernorm
epsUnused in softmax; included for API consistency with layernorm
BBatch size
TSequence length
CNumber of classes / logit dimension

The Three-Pass Algorithm

1

Find the maximum value

Scan all C elements of the token’s vector and track the largest value. This is the value that will be subtracted before exponentiation.
float max_val = x_bt[0];
for (int i = 1; i < C; i++) {
    if (x_bt[i] > max_val) max_val = x_bt[i];
}
2

Subtract max, exponentiate, and sum

For each element, compute exp(x[i] - max_val) and write it to the output buffer while accumulating the sum. This single pass both fills the output with un-normalized probabilities and computes the denominator.
float sum = 0;
for (int i = 0; i < C; i++) {
    sum += exp(x_bt[i] - max_val);
    out_bt[i] = exp(x_bt[i] - max_val);
}
3

Normalize by the sum

Divide each output element by the accumulated sum to produce a proper probability distribution.
for (int i = 0; i < C; i++) {
    out_bt[i] = out_bt[i] / sum;
}

Full Implementation

void softmax(float* out, float* x, float* weight, float* bias,
             float eps, int B, int T, int C) {

    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // Get a pointer to token (b, t)'s C-dim vector
            float* x_bt = x + b * T * C + t * C;

            // Step 1: Find the max value over C elements
            float max_val = x_bt[0];
            for (int i = 1; i < C; i++) {
                if (x_bt[i] > max_val) max_val = x_bt[i];
            }

            // Step 2: Subtract max, take exp of each, sum them up
            float sum = 0;
            float* out_bt = out + b * T * C + t * C;
            for (int i = 0; i < C; i++) {
                sum += exp(x_bt[i] - max_val);
                out_bt[i] = exp(x_bt[i] - max_val);
            }

            // Step 3: Divide each by the sum
            for (int i = 0; i < C; i++) {
                out_bt[i] = out_bt[i] / sum;
            }
        }
    }
}

Worked Example

int main() {
    float A[]    = {1, 2, 3, 4};   // input logits, shape (1, 1, 4)
    float out[4] = {0};
    float weights[] = {1, 1, 1, 1};
    float bias[]    = {1, 2, 3, 4};
    float eps = 1e-5;

    softmax(out, A, weights, bias, eps, 1, 1, 4);

    for (int i = 0; i < 4; i++) {
        cout << out[i] << endl;
    }
}
Output:
0.0320586
0.0871443
0.236883
0.643914
The probabilities sum to 1.0. The largest logit (4) receives the largest probability (~0.64) and the smallest logit (1) receives the smallest (~0.03). The distribution is exponentially peaked — a small difference in logit value produces a large difference in probability.
The same three-pass pattern appears inside attention_forward in train_gpt2_annotated.c for normalizing the Q·K attention scores. There, the input is a T-dimensional row of raw scores for one query token across all key positions, and the softmax is computed over that row to produce the attention weights. The max-subtract is done during the score computation pass (pass 1), the exp-and-sum in pass 2, and the normalization in pass 3 — directly mirroring the standalone softmax kernel here.

Build docs developers (and LLMs) love