Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

attention_forward is the core computational kernel of the transformer. It implements full multi-head attention in pure C++ — no tensor library, no reshape primitives, no broadcasting. The function handles three distinct use cases through two boolean-like parameters: encoder self-attention (no mask, same input for Q/K/V), decoder masked self-attention (causal mask applied, same input for Q/K/V), and decoder cross-attention (queries come from the decoder, keys and values come from the encoder output). All three modes go through the same code path.

Function signature

void attention_forward(float* out, float* x, float* k_input, float* v_input,
                       float* Wq, float* Wk, float* Wv, float* Wo,
                       int B, int T, int num_heads, int d_model, bool causal);
ParameterDescription
outOutput buffer, shape B × T × d_model
xQuery source — the current layer’s input
k_inputKey source — equals x for self-attention, encoder output for cross-attention
v_inputValue source — equals x for self-attention, encoder output for cross-attention
Wq, Wk, WvQuery, key, value projection weight matrices, each d_model × d_model
WoOutput projection weight matrix, d_model × d_model
B, TBatch size and sequence length
num_headsNumber of attention heads
d_modelModel embedding dimension
causalIf true, apply causal masking (future positions receive weight 0)

The algorithm

1

Allocate working buffers

Six buffers are heap-allocated at the start. None of them escape the function.
float* Q          = new float[B * T * d_model]();
float* K          = new float[B * T * d_model]();
float* V          = new float[B * T * d_model]();
float* scores     = new float[B * num_heads * T * T]();
float* attn_weights = new float[B * num_heads * T * T]();
float* attn_out   = new float[B * T * d_model]();
2

Project inputs into Q, K, V

Three matrix multiplications produce the query, key, and value representations. Crucially, k_input and v_input may differ from x — this is what enables cross-attention.
matmul(x,       Wq, nullptr, Q, B*T, d_model, d_model);
matmul(k_input, Wk, nullptr, K, B*T, d_model, d_model);
matmul(v_input, Wv, nullptr, V, B*T, d_model, d_model);
Each call reshapes the B×T token sequence into a 2D matrix of shape (B*T) × d_model and multiplies by the corresponding weight matrix.
3

Split into heads via flat indexing

There is no actual reshape or memory copy to split heads. The head dimension is implicit in the flat index formula:
Q[b, t, h, i]  →  Q[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i]
where d_k = d_model / num_heads. The Q, K, and V buffers are laid out in memory in the order [batch, token, head, head_dim], so iterating over (b, t, h, i) with this formula naturally isolates each head’s slice without any data movement.
int d_k = d_model / num_heads;
4

Compute scaled dot-product scores

For every (batch, query-token, head, key-token) combination, compute the dot product between Q and K vectors of dimension d_k, then scale by 1/sqrt(d_k):
for(int b = 0; b<B; b++){
    for(int t = 0; t<T; t++){
        for(int h = 0; h<num_heads; h++){
            for(int t2 = 0; t2<T; t2++){
                float val = 0;
                for(int i = 0; i<d_k; i++){
                    val += Q[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i]
                         * K[b*T*num_heads*d_k + t2*num_heads*d_k + h*d_k + i];
                }
                scores[b*num_heads*T*T + h*T*T + t*T + t2]  = val;
                scores[b*num_heads*T*T + h*T*T + t*T + t2] /= sqrt(d_k);
5

Apply causal mask

Immediately after scaling, if causal is true, any key position t2 that comes after the query position t is masked to negative infinity. After softmax, -1e9 becomes effectively zero, so future tokens receive no attention weight.
                if(causal){
                    if(t2 > t) scores[b*num_heads*T*T + h*T*T + t*T + t2] = -1e9f;
                }
6

Softmax over key positions

A numerically stable three-pass softmax is applied independently to each row of the score matrix (one row = all key positions for a single query token and head):
            // step 1: find max
            float* score_slice = scores + b*num_heads*T*T + h*T*T + t*T;
            float* attn_slice  = attn_weights + b*num_heads*T*T + h*T*T + t*T;
            float max_val = score_slice[0];
            for(int i = 1; i < T; i++){
                if(score_slice[i] > max_val) max_val = score_slice[i];
            }
            // step 2: subtract max, exp, accumulate sum
            float sum = 0;
            for(int i = 0; i < T; i++){
                sum += exp(score_slice[i] - max_val);
                attn_slice[i] = exp(score_slice[i] - max_val);
            }
            // step 3: normalize
            for(int i = 0; i < T; i++){
                attn_slice[i] = attn_slice[i] / sum;
            }
7

Weighted sum of V

The attention weights are used to compute a weighted sum of value vectors. The result is written into attn_out using the same head-splitting index formula:
            for(int i = 0; i<d_k; i++){
                float val = 0;
                for(int t2 = 0; t2<T; t2++){
                    val += attn_weights[b*num_heads*T*T + h*T*T + t*T + t2]
                         * V[b*T*num_heads*d_k + t2*num_heads*d_k + h*d_k + i];
                }
                attn_out[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i] = val;
            }
8

Output projection and cleanup

All heads are already concatenated in attn_out by virtue of the flat layout. A final matrix multiply by Wo mixes the head outputs:
matmul(attn_out, Wo, nullptr, out, B*T, d_model, d_model);
Then every buffer allocated at the start is freed:
delete[] Q;
delete[] K;
delete[] V;
delete[] scores;
delete[] attn_weights;
delete[] attn_out;

Head splitting via flat indexing

The key insight is that multi-head attention does not require physically splitting Q, K, V into separate arrays. Because C++ arrays are row-major flat memory, the logical layout [batch, token, head, head_dim] is equivalent to interleaving all heads in the innermost two dimensions. The index formula
Q[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i]
reads head h’s slice for batch b, token t, dimension i directly from the single Q buffer. No copy, no reshape, no extra allocation.

Causal masking

The mask is applied score-by-score inside the innermost loop:
if(causal && t2 > t) scores[b*num_heads*T*T + h*T*T + t*T + t2] = -1e9f;
Setting the score to -1e9 before softmax is equivalent to setting attention weight to zero: exp(-1e9 - max) ≈ 0. This ensures token at position t can only attend to positions 0 … t, which is required for autoregressive generation in the decoder’s self-attention sub-layer.

Cross-attention

When attention_forward is used for cross-attention in the decoder, x is the decoder’s current hidden state (the query source), while k_input and v_input are the encoder’s final output:
// Inside decoder_block:
attention_forward(cross_attn_out,
                  norm1,    // queries from decoder
                  enc_out,  // keys from encoder
                  enc_out,  // values from encoder
                  Wq2, Wk2, Wv2, Wo2,
                  B, T, num_heads, d_model,
                  false);   // no causal mask for cross-attention
The function needs no modification — passing different pointers for k_input/v_input is all that is required.
All six intermediate buffers (Q, K, V, scores, attn_weights, attn_out) are heap-allocated with new at the start and freed with delete[] before returning. The function writes its final result only into the caller-provided out buffer. There is no RAII wrapper — allocation and deallocation are explicit and symmetric.
d_model must be exactly divisible by num_heads, otherwise d_k = d_model / num_heads truncates silently and the head index math produces incorrect results. Add a compile-time or runtime assertion — assert(d_model % num_heads == 0) — before any production use.

Build docs developers (and LLMs) love