After implementing individual numerical kernels in C, this module assembles every piece into a complete, working transformer — the same encoder-decoder architecture underlying GPT-2 and the original “Attention Is All You Need” paper — written entirely in pure C++. There are no frameworks, no autograd graphs, no BLAS calls, and no external dependencies beyond the standard library. Every matrix multiply, every softmax, every layer normalization is a hand-written loop over a flatDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
float* buffer. The result is a fully operational transformer that you can read line-by-line and understand exactly what every instruction does.
What’s implemented
All twelve components are written from scratch and composed into a single forward pass:matmul— matrix multiplication with optional bias. Accumulator pattern: localfloat val, written to memory once per output cell.layernorm— per-token normalization in two passes: compute mean and variance, then normalize, scale, and shift.softmax— numerically stable softmax in three passes: find max, compute exp and sum, normalize.embeddings_forward— token lookup table scaled bysqrt(d_model).positional_encoding— sinusoidal PE added in-place usingsin/cosformulas.attention_forward— multi-head attention with fused QKV projection, causal masking via a boolean flag, and cross-attention via separate K/V inputs.feedforward_forward— two matmuls with ReLU activation and a 4× hidden dimension expansion.residual— element-wise add of input and sublayer output.projection_forward— final linear projection fromd_modeltovocab_size.encoder_block— attention → residual → layernorm → FFN → residual → layernorm.decoder_block— masked self-attention → residual → layernorm → cross-attention → residual → layernorm → FFN → residual → layernorm.transformer_block— full forward pass: src/tgt embeddings + PE, N encoder blocks, N decoder blocks, projection, softmax.
Architecture data flow
The transformer follows the classic encoder-decoder structure. Data moves in two parallel streams that merge in the decoder:enc_out. The same pattern applies to decoder blocks. The final encoder output is passed as the cross-attention key and value source into every decoder block.
How memory works
There is no RAII, nostd::vector, and no smart pointers. Every intermediate buffer is heap-allocated with new float[size]() — the () zero-initializes it — used for exactly one operation, and then freed with delete[] before the function returns. Only the final output is written to a caller-provided buffer.
B, T, d_model, etc.) passed at call time — there are no global tensors.
The test in
main() uses B=1, T=4, d_model=4, d_ff=16, num_heads=2, vocab_size=5, and N=2 with identity weight matrices. Identity weights make the output easy to verify by hand — any deviation from expected probabilities reveals a bug in the index arithmetic, not in learned parameters.Pages in this section
Embeddings & Positional Encoding
How token IDs become dense vectors and how sinusoidal position signals are injected in-place.
Multi-Head Attention
QKV projection, flat head-splitting index math, causal masking, and cross-attention in pure C++.
Encoder & Decoder Blocks
How attention, residual connections, layer norm, and FFN are composed into encoder and decoder blocks.
Full Forward Pass
The complete
transformer_block function — from integer token IDs to output probabilities — with test output.