Layer Normalization: Per-Token Normalization in C++

Layer normalization stabilizes training by normalizing each token’s embedding vector independently. Without it, activations in a deep transformer drift in scale across layers, gradients explode or vanish, and training diverges. Unlike batch normalization — which normalizes across the batch dimension — layer normalization normalizes across the channel dimension C for a single (b, t) token, making it independent of batch size and safe to use during inference. It runs before every attention block and every feed-forward block in the GPT-2 architecture.

Function Signature

void layernorm(float* out, float* x, float* weight, float* bias,
               float eps, int B, int T, int C);

Parameter	Shape	Description
`out`	`(B, T, C)`	Output tensor — normalized, scaled, and shifted activations
`x`	`(B, T, C)`	Input tensor — one C-dimensional vector per token
`weight`	`(C,)`	Scale (γ) — learned, one value per channel
`bias`	`(C,)`	Shift (β) — learned, one value per channel
`eps`	—	Small constant (typically `1e-5`) to prevent division by zero
`B`	—	Batch size
`T`	—	Sequence length
`C`	—	Embedding dimension (channels)

The Four-Step Algorithm

Layer normalization transforms each token’s C-dimensional vector through four sequential passes:

Mean

Sum all C elements of the token vector and divide by C.

mean = (x[0] + x[1] + ... + x[C-1]) / C

Variance

Compute the average squared deviation from the mean.

var = sum of (x[i] - mean)^2 / C

Reciprocal Standard Deviation (rstd)

Compute 1 / sqrt(var + eps). The eps term prevents a divide-by-zero when all inputs are identical (variance = 0). This precomputed reciprocal replaces a division in the final step with a multiplication, which is faster.

rstd = 1 / sqrt(var + eps)

Normalize, Scale, and Shift

Apply the learned affine transform to each channel.

out[i] = (x[i] - mean) * rstd * weight[i] + bias[i]

weight[i] is the per-channel scale (γ) and bias[i] is the per-channel shift (β). Both are learned during training.

Full Implementation

void layernorm(float* out, float* x, float* weight, float* bias,
               float eps, int B, int T, int C) {

    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // Get a pointer to token (b, t)'s C-dim vector
            float* x_bt = x + b * T * C + t * C;

            // Step 1: calculate the mean
            float total_sum = 0;
            for (int i = 0; i < C; i++) {
                total_sum += x_bt[i];
            }
            float mean = total_sum / C;

            // Step 2: calculate the variance
            float sum = 0;
            for (int i = 0; i < C; i++) {
                sum += (x_bt[i] - mean) * (x_bt[i] - mean);
            }
            float var = sum / C;

            // Step 3: calculate the rstd
            float rstd = 1 / sqrt(var + eps);

            // Step 4: Normalize + Scale + Shift
            float* out_bt = out + b * T * C + t * C;
            for (int i = 0; i < C; i++) {
                out_bt[i] = (x_bt[i] - mean) * rstd * weight[i] + bias[i];
            }
        }
    }
}

Loop Structure and Flat Indexing

The outer two loops iterate over every token position: B × T positions in total. For each position, the four steps operate on a single C-dimensional vector. The flat-index expression that locates that vector is:

float* x_bt = x + b * T * C + t * C;

This is the 3D indexing rule b*T*C + t*C + 0, dropping the channel offset because x_bt is a pointer to the start of the vector. Individual channel elements are then x_bt[i] for i in 0..C-1. The output pointer uses the identical formula:

float* out_bt = out + b * T * C + t * C;

Worked Example

int main() {
    float A[]       = {1, 2, 3, 4};   // input, shape (1, 1, 4) — one token, four channels
    float weights[] = {1, 1, 1, 1};   // scale: all ones (no rescaling)
    float bias[]    = {1, 2, 3, 4};   // shift: channel-wise offset
    float out[4]    = {0};
    float eps       = 1e-5;

    layernorm(out, A, weights, bias, eps, 1, 1, 4);

    for (int i = 0; i < 4; i++) {
        cout << out[i] << endl;
    }
}

The input [1, 2, 3, 4] has mean 2.5 and variance 1.25. After normalizing and applying the identity scale (weight = [1,1,1,1]) plus the additive bias [1,2,3,4], each output channel is shifted by the corresponding bias value on top of the normalized value.

The eps parameter (typically 1e-5) is small enough that it has no effect on normal activations, but it is essential when all inputs to a token are identical — for example, a zero-initialized embedding during early training. Without eps, var = 0 and rstd would be a divide-by-zero. Always include it.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Layer Normalization: Per-Token Normalization in C++

Function Signature

The Four-Step Algorithm

Full Implementation

Loop Structure and Flat Indexing

Worked Example

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

​Function Signature

​The Four-Step Algorithm

​Full Implementation

​Loop Structure and Flat Indexing

​Worked Example

Build docs developers (and LLMs) love

Function Signature

The Four-Step Algorithm

Full Implementation

Loop Structure and Flat Indexing

Worked Example