Multi-Head Attention with Causal Masking in C++

attention_forward is the core computational kernel of the transformer. It implements full multi-head attention in pure C++ — no tensor library, no reshape primitives, no broadcasting. The function handles three distinct use cases through two boolean-like parameters: encoder self-attention (no mask, same input for Q/K/V), decoder masked self-attention (causal mask applied, same input for Q/K/V), and decoder cross-attention (queries come from the decoder, keys and values come from the encoder output). All three modes go through the same code path.

Function signature

void attention_forward(float* out, float* x, float* k_input, float* v_input,
                       float* Wq, float* Wk, float* Wv, float* Wo,
                       int B, int T, int num_heads, int d_model, bool causal);

Parameter	Description
`out`	Output buffer, shape `B × T × d_model`
`x`	Query source — the current layer’s input
`k_input`	Key source — equals `x` for self-attention, encoder output for cross-attention
`v_input`	Value source — equals `x` for self-attention, encoder output for cross-attention
`Wq`, `Wk`, `Wv`	Query, key, value projection weight matrices, each `d_model × d_model`
`Wo`	Output projection weight matrix, `d_model × d_model`
`B`, `T`	Batch size and sequence length
`num_heads`	Number of attention heads
`d_model`	Model embedding dimension
`causal`	If `true`, apply causal masking (future positions receive weight 0)

The algorithm

Allocate working buffers

Six buffers are heap-allocated at the start. None of them escape the function.

float* Q          = new float[B * T * d_model]();
float* K          = new float[B * T * d_model]();
float* V          = new float[B * T * d_model]();
float* scores     = new float[B * num_heads * T * T]();
float* attn_weights = new float[B * num_heads * T * T]();
float* attn_out   = new float[B * T * d_model]();

Project inputs into Q, K, V

Three matrix multiplications produce the query, key, and value representations. Crucially, k_input and v_input may differ from x — this is what enables cross-attention.

matmul(x,       Wq, nullptr, Q, B*T, d_model, d_model);
matmul(k_input, Wk, nullptr, K, B*T, d_model, d_model);
matmul(v_input, Wv, nullptr, V, B*T, d_model, d_model);

Each call reshapes the B×T token sequence into a 2D matrix of shape (B*T) × d_model and multiplies by the corresponding weight matrix.

Split into heads via flat indexing

There is no actual reshape or memory copy to split heads. The head dimension is implicit in the flat index formula:

Q[b, t, h, i]  →  Q[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i]

where d_k = d_model / num_heads. The Q, K, and V buffers are laid out in memory in the order [batch, token, head, head_dim], so iterating over (b, t, h, i) with this formula naturally isolates each head’s slice without any data movement.

int d_k = d_model / num_heads;

Compute scaled dot-product scores

For every (batch, query-token, head, key-token) combination, compute the dot product between Q and K vectors of dimension d_k, then scale by 1/sqrt(d_k):

for(int b = 0; b<B; b++){
    for(int t = 0; t<T; t++){
        for(int h = 0; h<num_heads; h++){
            for(int t2 = 0; t2<T; t2++){
                float val = 0;
                for(int i = 0; i<d_k; i++){
                    val += Q[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i]
                         * K[b*T*num_heads*d_k + t2*num_heads*d_k + h*d_k + i];
                }
                scores[b*num_heads*T*T + h*T*T + t*T + t2]  = val;
                scores[b*num_heads*T*T + h*T*T + t*T + t2] /= sqrt(d_k);

Apply causal mask

Immediately after scaling, if causal is true, any key position t2 that comes after the query position t is masked to negative infinity. After softmax, -1e9 becomes effectively zero, so future tokens receive no attention weight.

                if(causal){
                    if(t2 > t) scores[b*num_heads*T*T + h*T*T + t*T + t2] = -1e9f;
                }

Softmax over key positions

A numerically stable three-pass softmax is applied independently to each row of the score matrix (one row = all key positions for a single query token and head):

            // step 1: find max
            float* score_slice = scores + b*num_heads*T*T + h*T*T + t*T;
            float* attn_slice  = attn_weights + b*num_heads*T*T + h*T*T + t*T;
            float max_val = score_slice[0];
            for(int i = 1; i < T; i++){
                if(score_slice[i] > max_val) max_val = score_slice[i];
            }
            // step 2: subtract max, exp, accumulate sum
            float sum = 0;
            for(int i = 0; i < T; i++){
                sum += exp(score_slice[i] - max_val);
                attn_slice[i] = exp(score_slice[i] - max_val);
            }
            // step 3: normalize
            for(int i = 0; i < T; i++){
                attn_slice[i] = attn_slice[i] / sum;
            }

Weighted sum of V

The attention weights are used to compute a weighted sum of value vectors. The result is written into attn_out using the same head-splitting index formula:

            for(int i = 0; i<d_k; i++){
                float val = 0;
                for(int t2 = 0; t2<T; t2++){
                    val += attn_weights[b*num_heads*T*T + h*T*T + t*T + t2]
                         * V[b*T*num_heads*d_k + t2*num_heads*d_k + h*d_k + i];
                }
                attn_out[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i] = val;
            }

Output projection and cleanup

All heads are already concatenated in attn_out by virtue of the flat layout. A final matrix multiply by Wo mixes the head outputs:

matmul(attn_out, Wo, nullptr, out, B*T, d_model, d_model);

Then every buffer allocated at the start is freed:

delete[] Q;
delete[] K;
delete[] V;
delete[] scores;
delete[] attn_weights;
delete[] attn_out;

Head splitting via flat indexing

The key insight is that multi-head attention does not require physically splitting Q, K, V into separate arrays. Because C++ arrays are row-major flat memory, the logical layout [batch, token, head, head_dim] is equivalent to interleaving all heads in the innermost two dimensions. The index formula

Q[b*T*num_heads*d_k + t*num_heads*d_k + h*d_k + i]

reads head h’s slice for batch b, token t, dimension i directly from the single Q buffer. No copy, no reshape, no extra allocation.

Causal masking

The mask is applied score-by-score inside the innermost loop:

if(causal && t2 > t) scores[b*num_heads*T*T + h*T*T + t*T + t2] = -1e9f;

Setting the score to -1e9 before softmax is equivalent to setting attention weight to zero: exp(-1e9 - max) ≈ 0. This ensures token at position t can only attend to positions 0 … t, which is required for autoregressive generation in the decoder’s self-attention sub-layer.

Cross-attention

When attention_forward is used for cross-attention in the decoder, x is the decoder’s current hidden state (the query source), while k_input and v_input are the encoder’s final output:

// Inside decoder_block:
attention_forward(cross_attn_out,
                  norm1,    // queries from decoder
                  enc_out,  // keys from encoder
                  enc_out,  // values from encoder
                  Wq2, Wk2, Wv2, Wo2,
                  B, T, num_heads, d_model,
                  false);   // no causal mask for cross-attention

The function needs no modification — passing different pointers for k_input/v_input is all that is required.

All six intermediate buffers (Q, K, V, scores, attn_weights, attn_out) are heap-allocated with new at the start and freed with delete[] before returning. The function writes its final result only into the caller-provided out buffer. There is no RAII wrapper — allocation and deallocation are explicit and symmetric.

d_model must be exactly divisible by num_heads, otherwise d_k = d_model / num_heads truncates silently and the head index math produces incorrect results. Add a compile-time or runtime assertion — assert(d_model % num_heads == 0) — before any production use.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Multi-Head Attention with Causal Masking in C++

Function signature

The algorithm

Head splitting via flat indexing

Causal masking

Cross-attention

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

​Function signature

​The algorithm

​Head splitting via flat indexing

​Causal masking

​Cross-attention

Build docs developers (and LLMs) love

Function signature

The algorithm

Head splitting via flat indexing

Causal masking

Cross-attention