Softmax converts a vector of raw logit scores into a probability distribution: all outputs are in the rangeDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
(0, 1) and they sum to exactly 1.0. The naive formula — compute exp(x[i]) for every element and divide by the sum — fails in practice because exp(large_number) overflows to infinity in float32. The numerically stable version subtracts the maximum value before exponentiating, which keeps every input to exp at or below zero and therefore bounded by 1. The mathematical result is identical because the subtracted constant cancels in the numerator and denominator.
Why Numerical Stability Matters
Consider an input vector[1000, 1001, 1002]. The naive softmax computes exp(1000), exp(1001), exp(1002) — all of which are +inf in 32-bit float. The result is NaN (inf / inf). The stable version subtracts the max (1002) first, giving [-2, -1, 0], then computes exp(-2), exp(-1), exp(0) — all perfectly representable finite values. The output probabilities are exactly the same as the mathematically correct answer.
exp(-max) factor cancels. Stability is free.
Function Signature
| Parameter | Shape | Description |
|---|---|---|
out | (B, T, C) | Output probabilities — each token’s C-dim distribution |
x | (B, T, C) | Input logits — one C-dim vector per token |
weight | (C,) | Unused in softmax; included for API consistency with layernorm |
bias | (C,) | Unused in softmax; included for API consistency with layernorm |
eps | — | Unused in softmax; included for API consistency with layernorm |
B | — | Batch size |
T | — | Sequence length |
C | — | Number of classes / logit dimension |
The Three-Pass Algorithm
Find the maximum value
Scan all C elements of the token’s vector and track the largest value. This is the value that will be subtracted before exponentiation.
Subtract max, exponentiate, and sum
For each element, compute
exp(x[i] - max_val) and write it to the output buffer while accumulating the sum. This single pass both fills the output with un-normalized probabilities and computes the denominator.Full Implementation
Worked Example
1.0. The largest logit (4) receives the largest probability (~0.64) and the smallest logit (1) receives the smallest (~0.03). The distribution is exponentially peaked — a small difference in logit value produces a large difference in probability.
The same three-pass pattern appears inside
attention_forward in train_gpt2_annotated.c for normalizing the Q·K attention scores. There, the input is a T-dimensional row of raw scores for one query token across all key positions, and the softmax is computed over that row to produce the attention weights. The max-subtract is done during the score computation pass (pass 1), the exp-and-sum in pass 2, and the normalization in pass 3 — directly mirroring the standalone softmax kernel here.