Layer normalization stabilizes training by normalizing each token’s embedding vector independently. Without it, activations in a deep transformer drift in scale across layers, gradients explode or vanish, and training diverges. Unlike batch normalization — which normalizes across the batch dimension — layer normalization normalizes across the channel dimensionDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
C for a single (b, t) token, making it independent of batch size and safe to use during inference. It runs before every attention block and every feed-forward block in the GPT-2 architecture.
Function Signature
| Parameter | Shape | Description |
|---|---|---|
out | (B, T, C) | Output tensor — normalized, scaled, and shifted activations |
x | (B, T, C) | Input tensor — one C-dimensional vector per token |
weight | (C,) | Scale (γ) — learned, one value per channel |
bias | (C,) | Shift (β) — learned, one value per channel |
eps | — | Small constant (typically 1e-5) to prevent division by zero |
B | — | Batch size |
T | — | Sequence length |
C | — | Embedding dimension (channels) |
The Four-Step Algorithm
Layer normalization transforms each token’s C-dimensional vector through four sequential passes:Reciprocal Standard Deviation (rstd)
Compute
1 / sqrt(var + eps). The eps term prevents a divide-by-zero when all inputs are identical (variance = 0). This precomputed reciprocal replaces a division in the final step with a multiplication, which is faster.Full Implementation
Loop Structure and Flat Indexing
The outer two loops iterate over every token position:B × T positions in total. For each position, the four steps operate on a single C-dimensional vector. The flat-index expression that locates that vector is:
b*T*C + t*C + 0, dropping the channel offset because x_bt is a pointer to the start of the vector. Individual channel elements are then x_bt[i] for i in 0..C-1.
The output pointer uses the identical formula:
Worked Example
[1, 2, 3, 4] has mean 2.5 and variance 1.25. After normalizing and applying the identity scale (weight = [1,1,1,1]) plus the additive bias [1,2,3,4], each output channel is shifted by the corresponding bias value on top of the normalized value.