The encoder and decoder blocks are the repeating units of the transformer. They are not new primitives — they are specific compositions of components already covered: attention, layer normalization, feed-forward networks, and residual connections. Understanding a block means understanding the order in which those components are called, which buffers are passed between them, and what the data looks like at each step. This page walks through the feed-forward network, residual connection, and then the full encoder and decoder block implementations as they appear inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
model.cpp.
Feed-forward network
Each attention sub-layer is followed by a position-wise feed-forward network: two linear projections with a ReLU nonlinearity in between. The hidden dimensiond_ff is set to 4 × d_model, expanding the representation into a wider space before projecting back down.
Function signature
Implementation
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂:
matmul(x, W1, b1, intermediate, ...)— project fromd_modeltod_ff, add biasb1.- In-place ReLU loop — every negative value becomes zero.
matmul(intermediate, W2, b2, out, ...)— project back fromd_fftod_model, add biasb2.
new, used for the expanded representation, then freed before returning.
Residual connection
The residual connection is the simplest function in the codebase — a flat element-wise add:x is the block’s input before the sub-layer. sublayer_out is the result of attention or FFN. Their sum is the residual-connected output. This skip connection ensures the original input signal is preserved at every layer, which is what allows deep transformer stacks to remain numerically stable.
Encoder block
The encoder block applies two sub-layers in sequence, each wrapped in a residual connection and layer normalization. The variant used here is Post-LN (normalize after residual add), matching the original paper. Order of operations:- Self-attention on
x(no causal mask) - Residual:
residual1 = x + attn_out - LayerNorm:
norm1 = layernorm(residual1) - Feed-forward on
norm1 - Residual:
residual2 = norm1 + ff_out - LayerNorm → output
Function signature
Implementation
attention_forward is called with x, x, x for the x, k_input, and v_input parameters — encoder self-attention reads queries, keys, and values all from the same input. The causal=false flag means all tokens can attend to all other tokens in both directions.
Decoder block
The decoder block has three sub-layers instead of two. The extra sub-layer is cross-attention, which connects the decoder to the encoder’s output. The decoder also uses a causal mask on its self-attention sub-layer to prevent each token from attending to future positions — required for autoregressive generation where each output token can only depend on previously generated tokens. Order of operations:- Masked self-attention on
x(causal=true) - Residual + LayerNorm →
norm1 - Cross-attention: queries from
norm1, keys/values fromenc_out(causal=false) - Residual + LayerNorm →
norm2 - Feed-forward on
norm2 - Residual + LayerNorm → output
Function signature
Wq1/Wk1/Wv1/Wo1 for the masked self-attention sub-layer and Wq2/Wk2/Wv2/Wo2 for the cross-attention sub-layer. There are also three layernorm parameter pairs (gamma1/beta1, gamma2/beta2, gamma3/beta3) — one per sub-layer.
Implementation
attention_forward(cross_attn_out, norm1, enc_out, enc_out, ...) passes enc_out for both k_input and v_input. The decoder’s normalized hidden state norm1 provides only the query — this is the mechanism by which the decoder “reads” the encoder’s representation at every layer.
Each block allocates all its intermediate buffers at entry and frees them all before returning. The only data written outside the function is the final
out buffer provided by the caller. This means buffers from one block are completely freed before the next block in the N-block loop starts — peak memory usage at any one moment is the size of one block’s working set, not the entire stack.