Every language model, from a small GPT-2 to a modern large model, is built from the same handful of operations: matrix multiplication, layer normalization, softmax, and token embeddings. This module implements each of those operations from scratch in pure C and C++, with no library dependencies, no autograd, and no abstraction layers. The goal is not performance — it is understanding. Before you can write a CUDA kernel for an operation, you must be able to write the same operation in a plainDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
for loop and know exactly what the memory layout looks like.
Why Start in C?
GPU memory is flat. There is no “2D array” on a GPU, no object, no tensor struct. When you allocate memory on a GPU, you get a pointer to a contiguous block of floats — a 1D array. To treat it as a matrix, you compute an integer offset using shape arithmetic. This is identical to how C represents arrays, which means a C implementation of a kernel and its GPU equivalent share the same memory access pattern. Writing the CPU version first makes porting to CUDA mechanical: the loop body becomes the thread body, and the loop indices become thread coordinates.What This Module Covers
Tensor Indexing
How multi-dimensional tensors are stored as flat 1D arrays, and the index arithmetic to access any element by shape. The prerequisite for every other page.
Matrix Multiplication
The three-loop matmul pattern with flat indexing, optional bias, and the local accumulator that minimises memory writes.
Layer Normalization
Per-token normalization: the four-step mean / variance / rstd / normalize pattern over a (B, T, C) shaped activation tensor.
Softmax
Numerically stable softmax using the three-pass max-subtract / exp-sum / divide algorithm, with a worked example.
GPT-2 Study
Reading and annotating Karpathy’s llm.c — encoder forward/backward, layernorm forward/backward, and the full 124 M parameter layout.
The Study Approach
The study follows two tracks in parallel. The first is reading Karpathy’strain_gpt2.c line by line, adding annotations that explain the tensor shapes, the flat-index formulas, and why each intermediate value is cached for the backward pass. The second is re-implementing each kernel from scratch — matmul.cpp, layernorm.cpp, softmax.cpp — with main() functions that print expected outputs so correctness can be verified by eye.
Read the annotated source
Start with
train_gpt2_annotated.c. Every function has a comment block describing its tensor shapes, the math it performs, and how it maps to the GPT-2 architecture diagram.Implement each kernel standalone
Each of
matmul.cpp, layernorm.cpp, and softmax.cpp is a self-contained file with a main() that exercises the kernel and prints results you can check by hand.Build the pybind11 bridge
Once a kernel is correct in C++,
tensor-primitives/primitives.cpp shows how to expose it to Python via pybind11 so it can be called from a training loop or compared against PyTorch.The pybind11 Bridge
tensor-primitives/primitives.cpp is a starter file that shows how to wrap a C++ function with pybind11 and expose it as a Python module. The CMakeLists.txt in the same directory uses pybind11_add_module to build a shared library called myprims.
This starter currently wraps a trivial
add function. The intent is to replace it with real kernel implementations — matmul, layernorm, softmax — so they can be imported and benchmarked directly from Python against PyTorch equivalents.