Documentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
attention_forward is the core computational kernel of the transformer. It implements full multi-head attention in pure C++ — no tensor library, no reshape primitives, no broadcasting. The function handles three distinct use cases through two boolean-like parameters: encoder self-attention (no mask, same input for Q/K/V), decoder masked self-attention (causal mask applied, same input for Q/K/V), and decoder cross-attention (queries come from the decoder, keys and values come from the encoder output). All three modes go through the same code path.
Function signature
| Parameter | Description |
|---|---|
out | Output buffer, shape B × T × d_model |
x | Query source — the current layer’s input |
k_input | Key source — equals x for self-attention, encoder output for cross-attention |
v_input | Value source — equals x for self-attention, encoder output for cross-attention |
Wq, Wk, Wv | Query, key, value projection weight matrices, each d_model × d_model |
Wo | Output projection weight matrix, d_model × d_model |
B, T | Batch size and sequence length |
num_heads | Number of attention heads |
d_model | Model embedding dimension |
causal | If true, apply causal masking (future positions receive weight 0) |
The algorithm
Allocate working buffers
Six buffers are heap-allocated at the start. None of them escape the function.
Project inputs into Q, K, V
Three matrix multiplications produce the query, key, and value representations. Crucially, Each call reshapes the
k_input and v_input may differ from x — this is what enables cross-attention.B×T token sequence into a 2D matrix of shape (B*T) × d_model and multiplies by the corresponding weight matrix.Split into heads via flat indexing
There is no actual reshape or memory copy to split heads. The head dimension is implicit in the flat index formula:where
d_k = d_model / num_heads. The Q, K, and V buffers are laid out in memory in the order [batch, token, head, head_dim], so iterating over (b, t, h, i) with this formula naturally isolates each head’s slice without any data movement.Compute scaled dot-product scores
For every (batch, query-token, head, key-token) combination, compute the dot product between Q and K vectors of dimension
d_k, then scale by 1/sqrt(d_k):Apply causal mask
Immediately after scaling, if
causal is true, any key position t2 that comes after the query position t is masked to negative infinity. After softmax, -1e9 becomes effectively zero, so future tokens receive no attention weight.Softmax over key positions
A numerically stable three-pass softmax is applied independently to each row of the score matrix (one row = all key positions for a single query token and head):
Weighted sum of V
The attention weights are used to compute a weighted sum of value vectors. The result is written into
attn_out using the same head-splitting index formula:Head splitting via flat indexing
The key insight is that multi-head attention does not require physically splitting Q, K, V into separate arrays. Because C++ arrays are row-major flat memory, the logical layout[batch, token, head, head_dim] is equivalent to interleaving all heads in the innermost two dimensions. The index formula
h’s slice for batch b, token t, dimension i directly from the single Q buffer. No copy, no reshape, no extra allocation.
Causal masking
The mask is applied score-by-score inside the innermost loop:-1e9 before softmax is equivalent to setting attention weight to zero: exp(-1e9 - max) ≈ 0. This ensures token at position t can only attend to positions 0 … t, which is required for autoregressive generation in the decoder’s self-attention sub-layer.
Cross-attention
Whenattention_forward is used for cross-attention in the decoder, x is the decoder’s current hidden state (the query source), while k_input and v_input are the encoder’s final output:
k_input/v_input is all that is required.
All six intermediate buffers (
Q, K, V, scores, attn_weights, attn_out) are heap-allocated with new at the start and freed with delete[] before returning. The function writes its final result only into the caller-provided out buffer. There is no RAII wrapper — allocation and deallocation are explicit and symmetric.