Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

After implementing individual numerical kernels in C, this module assembles every piece into a complete, working transformer — the same encoder-decoder architecture underlying GPT-2 and the original “Attention Is All You Need” paper — written entirely in pure C++. There are no frameworks, no autograd graphs, no BLAS calls, and no external dependencies beyond the standard library. Every matrix multiply, every softmax, every layer normalization is a hand-written loop over a flat float* buffer. The result is a fully operational transformer that you can read line-by-line and understand exactly what every instruction does.

What’s implemented

All twelve components are written from scratch and composed into a single forward pass:
  1. matmul — matrix multiplication with optional bias. Accumulator pattern: local float val, written to memory once per output cell.
  2. layernorm — per-token normalization in two passes: compute mean and variance, then normalize, scale, and shift.
  3. softmax — numerically stable softmax in three passes: find max, compute exp and sum, normalize.
  4. embeddings_forward — token lookup table scaled by sqrt(d_model).
  5. positional_encoding — sinusoidal PE added in-place using sin/cos formulas.
  6. attention_forward — multi-head attention with fused QKV projection, causal masking via a boolean flag, and cross-attention via separate K/V inputs.
  7. feedforward_forward — two matmuls with ReLU activation and a 4× hidden dimension expansion.
  8. residual — element-wise add of input and sublayer output.
  9. projection_forward — final linear projection from d_model to vocab_size.
  10. encoder_block — attention → residual → layernorm → FFN → residual → layernorm.
  11. decoder_block — masked self-attention → residual → layernorm → cross-attention → residual → layernorm → FFN → residual → layernorm.
  12. transformer_block — full forward pass: src/tgt embeddings + PE, N encoder blocks, N decoder blocks, projection, softmax.

Architecture data flow

The transformer follows the classic encoder-decoder structure. Data moves in two parallel streams that merge in the decoder:
Source path
───────────
src_tokens (int[])
  → embeddings_forward    (token IDs → dense float vectors, scaled by √d_model)
  → positional_encoding   (sin/cos PE added in-place)
  → encoder_block × N     (self-attention + FFN, repeated N times)
  → enc_out               (final encoder hidden states)

Target path
───────────
tgt_tokens (int[])
  → embeddings_forward
  → positional_encoding
  → decoder_block × N     (masked self-attn + cross-attn(enc_out) + FFN, × N)
  → dec_out

Output
──────
dec_out
  → projection_forward    (d_model → vocab_size linear layer)
  → softmax               (numerically stable, per-token)
  → out                   (probability distribution over vocabulary)
Each encoder block takes the previous block’s output as its input. The first encoder block reads from the embedded source; every subsequent block reads from enc_out. The same pattern applies to decoder blocks. The final encoder output is passed as the cross-attention key and value source into every decoder block.

How memory works

There is no RAII, no std::vector, and no smart pointers. Every intermediate buffer is heap-allocated with new float[size]() — the () zero-initializes it — used for exactly one operation, and then freed with delete[] before the function returns. Only the final output is written to a caller-provided buffer.
// Typical pattern inside any block function:
float* attn_out = new float[B * T * d_model]();
float* residual1 = new float[B * T * d_model]();

attention_forward(attn_out, x, x, x, Wq, Wk, Wv, Wo, B, T, num_heads, d_model, false);
residual(residual1, x, attn_out, B, T, d_model);
// ... more operations ...

delete[] attn_out;
delete[] residual1;
This explicit allocation/free cycle makes the memory ownership model completely transparent: the function that allocates a buffer is always the function that frees it. Shapes are computed from the dimension parameters (B, T, d_model, etc.) passed at call time — there are no global tensors.
The test in main() uses B=1, T=4, d_model=4, d_ff=16, num_heads=2, vocab_size=5, and N=2 with identity weight matrices. Identity weights make the output easy to verify by hand — any deviation from expected probabilities reveals a bug in the index arithmetic, not in learned parameters.

Pages in this section

Embeddings & Positional Encoding

How token IDs become dense vectors and how sinusoidal position signals are injected in-place.

Multi-Head Attention

QKV projection, flat head-splitting index math, causal masking, and cross-attention in pure C++.

Encoder & Decoder Blocks

How attention, residual connections, layer norm, and FFN are composed into encoder and decoder blocks.

Full Forward Pass

The complete transformer_block function — from integer token IDs to output probabilities — with test output.

Build docs developers (and LLMs) love