Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

The encoder and decoder blocks are the repeating units of the transformer. They are not new primitives — they are specific compositions of components already covered: attention, layer normalization, feed-forward networks, and residual connections. Understanding a block means understanding the order in which those components are called, which buffers are passed between them, and what the data looks like at each step. This page walks through the feed-forward network, residual connection, and then the full encoder and decoder block implementations as they appear in model.cpp.

Feed-forward network

Each attention sub-layer is followed by a position-wise feed-forward network: two linear projections with a ReLU nonlinearity in between. The hidden dimension d_ff is set to 4 × d_model, expanding the representation into a wider space before projecting back down.

Function signature

void feedforward_forward(float* out, float* x, float* W1, float* b1, float* W2, float* b2,
                         int B, int T, int d_model, int d_ff);

Implementation

void feedforward_forward(float* out, float* x, float* W1, float* b1, float* W2, float* b2,
                         int B, int T, int d_model, int d_ff){
    float* intermediate = new float[B * T * d_ff]();

    // First matmul: x * W1 + b1 -> intermediate  (d_model -> d_ff)
    matmul(x, W1, b1, intermediate, B*T, d_model, d_ff);

    // ReLU: clamp negative values to zero
    for(int ele = 0; ele < (B*T*d_ff); ele++){
        if(intermediate[ele] < 0) intermediate[ele] = 0;
    }

    // Second matmul: intermediate * W2 + b2 -> out  (d_ff -> d_model)
    matmul(intermediate, W2, b2, out, B*T, d_ff, d_model);

    delete[] intermediate;
}
The three steps map directly to the formula FFN(x) = max(0, xW₁ + b₁)W₂ + b₂:
  1. matmul(x, W1, b1, intermediate, ...) — project from d_model to d_ff, add bias b1.
  2. In-place ReLU loop — every negative value becomes zero.
  3. matmul(intermediate, W2, b2, out, ...) — project back from d_ff to d_model, add bias b2.
The single intermediate buffer is allocated with new, used for the expanded representation, then freed before returning.

Residual connection

The residual connection is the simplest function in the codebase — a flat element-wise add:
void residual(float* out, float* x, float* sublayer_out, int B, int T, int d_model){
    for(int b = 0; b<B; b++){
        for(int t = 0; t<T; t++){
            for(int row = 0; row < d_model; row++){
                out[b*T*d_model + t*d_model + row] =
                    x[b*T*d_model + t*d_model + row] +
                    sublayer_out[b*T*d_model + t*d_model + row];
            }
        }
    }
}
x is the block’s input before the sub-layer. sublayer_out is the result of attention or FFN. Their sum is the residual-connected output. This skip connection ensures the original input signal is preserved at every layer, which is what allows deep transformer stacks to remain numerically stable.

Encoder block

The encoder block applies two sub-layers in sequence, each wrapped in a residual connection and layer normalization. The variant used here is Post-LN (normalize after residual add), matching the original paper. Order of operations:
  1. Self-attention on x (no causal mask)
  2. Residual: residual1 = x + attn_out
  3. LayerNorm: norm1 = layernorm(residual1)
  4. Feed-forward on norm1
  5. Residual: residual2 = norm1 + ff_out
  6. LayerNorm → output

Function signature

void encoder_block(float* out, float* x,
                   float* Wq, float* Wk, float* Wv, float* Wo,
                   float* W1, float* b1, float* W2, float* b2,
                   float* gamma1, float* beta1,
                   float* gamma2, float* beta2,
                   float eps, int B, int T, int num_heads, int d_model, int d_ff);

Implementation

void encoder_block(float* out, float* x,
                   float* Wq, float* Wk, float* Wv, float* Wo,
                   float* W1, float* b1, float* W2, float* b2,
                   float* gamma1, float* beta1,
                   float* gamma2, float* beta2,
                   float eps, int B, int T, int num_heads, int d_model, int d_ff){

    float* attn_out  = new float[B * T * d_model];
    float* residual1 = new float[B * T * d_model];
    float* norm1     = new float[B * T * d_model];
    float* ff_out    = new float[B * T * d_model];
    float* residual2 = new float[B * T * d_model];

    // Sub-layer 1: self-attention + residual + layernorm
    attention_forward(attn_out, x, x, x, Wq, Wk, Wv, Wo, B, T, num_heads, d_model, false);
    residual(residual1, x, attn_out, B, T, d_model);
    layernorm(norm1, residual1, gamma1, beta1, eps, B, T, d_model);

    // Sub-layer 2: FFN + residual + layernorm
    feedforward_forward(ff_out, norm1, W1, b1, W2, b2, B, T, d_model, d_ff);
    residual(residual2, norm1, ff_out, B, T, d_model);
    layernorm(out, residual2, gamma2, beta2, eps, B, T, d_model);

    delete[] attn_out;
    delete[] residual1;
    delete[] norm1;
    delete[] ff_out;
    delete[] residual2;
}
Notice that attention_forward is called with x, x, x for the x, k_input, and v_input parameters — encoder self-attention reads queries, keys, and values all from the same input. The causal=false flag means all tokens can attend to all other tokens in both directions.

Decoder block

The decoder block has three sub-layers instead of two. The extra sub-layer is cross-attention, which connects the decoder to the encoder’s output. The decoder also uses a causal mask on its self-attention sub-layer to prevent each token from attending to future positions — required for autoregressive generation where each output token can only depend on previously generated tokens. Order of operations:
  1. Masked self-attention on x (causal=true)
  2. Residual + LayerNorm → norm1
  3. Cross-attention: queries from norm1, keys/values from enc_out (causal=false)
  4. Residual + LayerNorm → norm2
  5. Feed-forward on norm2
  6. Residual + LayerNorm → output

Function signature

void decoder_block(float* out, float* x, float* enc_out,
                   float* Wq1, float* Wk1, float* Wv1, float* Wo1,  // masked self-attn weights
                   float* Wq2, float* Wk2, float* Wv2, float* Wo2,  // cross-attn weights
                   float* W1, float* b1, float* W2, float* b2,
                   float* gamma1, float* beta1,
                   float* gamma2, float* beta2,
                   float* gamma3, float* beta3,
                   float eps, int B, int T, int num_heads, int d_model, int d_ff);
The decoder block takes two separate sets of QKV projection weights: Wq1/Wk1/Wv1/Wo1 for the masked self-attention sub-layer and Wq2/Wk2/Wv2/Wo2 for the cross-attention sub-layer. There are also three layernorm parameter pairs (gamma1/beta1, gamma2/beta2, gamma3/beta3) — one per sub-layer.

Implementation

void decoder_block(float* out, float* x, float* enc_out,
                   float* Wq1, float* Wk1, float* Wv1, float* Wo1,
                   float* Wq2, float* Wk2, float* Wv2, float* Wo2,
                   float* W1, float* b1, float* W2, float* b2,
                   float* gamma1, float* beta1,
                   float* gamma2, float* beta2,
                   float* gamma3, float* beta3,
                   float eps, int B, int T, int num_heads, int d_model, int d_ff){

    float* mask_attn_out  = new float[B * T * d_model];
    float* residual1      = new float[B * T * d_model];
    float* norm1          = new float[B * T * d_model];
    float* cross_attn_out = new float[B * T * d_model];
    float* residual2      = new float[B * T * d_model];
    float* norm2          = new float[B * T * d_model];
    float* ffn_out        = new float[B * T * d_model];
    float* residual3      = new float[B * T * d_model];

    // Sub-layer 1: masked self-attention + residual + layernorm
    attention_forward(mask_attn_out, x, x, x,
                      Wq1, Wk1, Wv1, Wo1, B, T, num_heads, d_model, true);
    residual(residual1, x, mask_attn_out, B, T, d_model);
    layernorm(norm1, residual1, gamma1, beta1, eps, B, T, d_model);

    // Sub-layer 2: cross-attention + residual + layernorm
    // norm1 provides queries; enc_out provides keys and values
    attention_forward(cross_attn_out, norm1, enc_out, enc_out,
                      Wq2, Wk2, Wv2, Wo2, B, T, num_heads, d_model, false);
    residual(residual2, norm1, cross_attn_out, B, T, d_model);
    layernorm(norm2, residual2, gamma2, beta2, eps, B, T, d_model);

    // Sub-layer 3: FFN + residual + layernorm
    feedforward_forward(ffn_out, norm2, W1, b1, W2, b2, B, T, d_model, d_ff);
    residual(residual3, norm2, ffn_out, B, T, d_model);
    layernorm(out, residual3, gamma3, beta3, eps, B, T, d_model);

    delete[] mask_attn_out;
    delete[] residual1;
    delete[] norm1;
    delete[] cross_attn_out;
    delete[] residual2;
    delete[] norm2;
    delete[] ffn_out;
    delete[] residual3;
}
The cross-attention call attention_forward(cross_attn_out, norm1, enc_out, enc_out, ...) passes enc_out for both k_input and v_input. The decoder’s normalized hidden state norm1 provides only the query — this is the mechanism by which the decoder “reads” the encoder’s representation at every layer.
Each block allocates all its intermediate buffers at entry and frees them all before returning. The only data written outside the function is the final out buffer provided by the caller. This means buffers from one block are completely freed before the next block in the N-block loop starts — peak memory usage at any one moment is the size of one block’s working set, not the entire stack.

Build docs developers (and LLMs) love