Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

Layer normalization stabilizes training by normalizing each token’s embedding vector independently. Without it, activations in a deep transformer drift in scale across layers, gradients explode or vanish, and training diverges. Unlike batch normalization — which normalizes across the batch dimension — layer normalization normalizes across the channel dimension C for a single (b, t) token, making it independent of batch size and safe to use during inference. It runs before every attention block and every feed-forward block in the GPT-2 architecture.

Function Signature

void layernorm(float* out, float* x, float* weight, float* bias,
               float eps, int B, int T, int C);
ParameterShapeDescription
out(B, T, C)Output tensor — normalized, scaled, and shifted activations
x(B, T, C)Input tensor — one C-dimensional vector per token
weight(C,)Scale (γ) — learned, one value per channel
bias(C,)Shift (β) — learned, one value per channel
epsSmall constant (typically 1e-5) to prevent division by zero
BBatch size
TSequence length
CEmbedding dimension (channels)

The Four-Step Algorithm

Layer normalization transforms each token’s C-dimensional vector through four sequential passes:
1

Mean

Sum all C elements of the token vector and divide by C.
mean = (x[0] + x[1] + ... + x[C-1]) / C
2

Variance

Compute the average squared deviation from the mean.
var = sum of (x[i] - mean)^2 / C
3

Reciprocal Standard Deviation (rstd)

Compute 1 / sqrt(var + eps). The eps term prevents a divide-by-zero when all inputs are identical (variance = 0). This precomputed reciprocal replaces a division in the final step with a multiplication, which is faster.
rstd = 1 / sqrt(var + eps)
4

Normalize, Scale, and Shift

Apply the learned affine transform to each channel.
out[i] = (x[i] - mean) * rstd * weight[i] + bias[i]
weight[i] is the per-channel scale (γ) and bias[i] is the per-channel shift (β). Both are learned during training.

Full Implementation

void layernorm(float* out, float* x, float* weight, float* bias,
               float eps, int B, int T, int C) {

    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // Get a pointer to token (b, t)'s C-dim vector
            float* x_bt = x + b * T * C + t * C;

            // Step 1: calculate the mean
            float total_sum = 0;
            for (int i = 0; i < C; i++) {
                total_sum += x_bt[i];
            }
            float mean = total_sum / C;

            // Step 2: calculate the variance
            float sum = 0;
            for (int i = 0; i < C; i++) {
                sum += (x_bt[i] - mean) * (x_bt[i] - mean);
            }
            float var = sum / C;

            // Step 3: calculate the rstd
            float rstd = 1 / sqrt(var + eps);

            // Step 4: Normalize + Scale + Shift
            float* out_bt = out + b * T * C + t * C;
            for (int i = 0; i < C; i++) {
                out_bt[i] = (x_bt[i] - mean) * rstd * weight[i] + bias[i];
            }
        }
    }
}

Loop Structure and Flat Indexing

The outer two loops iterate over every token position: B × T positions in total. For each position, the four steps operate on a single C-dimensional vector. The flat-index expression that locates that vector is:
float* x_bt = x + b * T * C + t * C;
This is the 3D indexing rule b*T*C + t*C + 0, dropping the channel offset because x_bt is a pointer to the start of the vector. Individual channel elements are then x_bt[i] for i in 0..C-1. The output pointer uses the identical formula:
float* out_bt = out + b * T * C + t * C;

Worked Example

int main() {
    float A[]       = {1, 2, 3, 4};   // input, shape (1, 1, 4) — one token, four channels
    float weights[] = {1, 1, 1, 1};   // scale: all ones (no rescaling)
    float bias[]    = {1, 2, 3, 4};   // shift: channel-wise offset
    float out[4]    = {0};
    float eps       = 1e-5;

    layernorm(out, A, weights, bias, eps, 1, 1, 4);

    for (int i = 0; i < 4; i++) {
        cout << out[i] << endl;
    }
}
The input [1, 2, 3, 4] has mean 2.5 and variance 1.25. After normalizing and applying the identity scale (weight = [1,1,1,1]) plus the additive bias [1,2,3,4], each output channel is shifted by the corresponding bias value on top of the normalized value.
The eps parameter (typically 1e-5) is small enough that it has no effect on normal activations, but it is essential when all inputs to a token are identical — for example, a zero-initialized embedding during early training. Without eps, var = 0 and rstd would be a divide-by-zero. Always include it.

Build docs developers (and LLMs) love