Documentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
train_gpt2_annotated.c is Karpathy’s GPT-2 training code with line-by-line annotations added to explain the tensor shapes, the flat-index formulas, why each intermediate value is kept, and how each function maps to the GPT-2 architecture. The goal is to understand every forward and backward pass function before writing CUDA versions of them. Reading the C code — where the loops are explicit and the memory layout is undisguised — is a far more direct path to that understanding than reading framework source code.
What llm.c Is
llm.c is Andrej Karpathy’s minimal, CPU-only GPT-2 trainer written in a single C file. There is no framework, no automatic differentiation, no tensor abstraction layer. Every forward pass is a plain C function that takesfloat* pointers and integers. Every backward pass is the manually derived gradient of the corresponding forward pass, written as another plain C function. The code is readable in a way that framework-backed implementations cannot be: you can see the exact loops, the exact indices, and the exact memory access pattern.
The annotated version adds explanation blocks above every major function describing:
- The shape of every input and output tensor
- The flat-index formula used to address each element
- Where in the GPT-2 architecture the function executes
- What intermediate values the forward pass must cache for the backward pass
Key Functions Annotated
encoder_forward
Combines token and position embeddings. For each
(b, t) position, looks up the token row from wte and the position row from wpe, element-wise adds them, and writes to out[b, t, :].encoder_backward
Accumulates gradients back into
dwte and dwpe. For each (b, t), adds the upstream gradient dout[b, t, :] to both dwte[token_id, :] and dwpe[t, :].layernorm_forward
Same four-step mean/variance/rstd/normalize pattern as the standalone
layernorm.cpp, but caches mean and rstd in separate (B, T) buffers for use in the backward pass.layernorm_backward
Two-pass gradient computation. First pass accumulates
dnorm_mean and dnorm_norm_mean across the C dimension. Second pass uses those two scalars to compute the gradient for each channel.encoder_forward: Token + Position Embeddings
The first function every token passes through. Its annotation in the source explains the index pattern that every subsequent function also uses:wte is the token embedding table with shape (V, C) — one C-dimensional vector per vocabulary entry. wpe is the position embedding table with shape (maxT, C) — one C-dimensional vector per position. The lookup into wte uses the integer token ID ix, giving an offset of ix * C. The lookup into wpe uses the position t, giving an offset of t * C.
Flat Indexing: The Consistent Pattern
Every function intrain_gpt2_annotated.c follows the same indexing rule. For a tensor of shape (D1, D2, D3), element (i1, i2, i3) is at flat offset:
encoder_forward, encoder_backward, layernorm_forward, layernorm_backward, matmul_forward, matmul_backward, attention_forward, and every other function in the file.
GPT-2 Parameter Layout
temp.c studies the malloc_and_point_parameters function from llm.c by printing the cumulative parameter count as each parameter tensor is added. The param_sizes array holds the size of each of the 16 parameter tensors in GPT-2:
float (4 bytes), so the full parameter tensor requires roughly 475 MB.
The pybind11 Bridge
tensor-primitives/primitives.cpp and its CMakeLists.txt provide a starting point for exposing C++ kernel implementations to Python via pybind11. The current file wraps a trivial add function, but the structure is identical to what would be needed to wrap matmul, layernorm, or softmax:
pybind11_add_module to build the shared library:
import myprims and the exposed functions called directly, enabling benchmarking against PyTorch equivalents.
Reading each forward and backward pass together is essential. The backward pass reveals exactly what intermediate values the forward pass must cache. For example,
layernorm_forward caches mean and rstd in separate (B, T) buffers that are not needed for the output but are required by layernorm_backward to compute the input gradient efficiently. You cannot understand the caching decisions in a forward pass without first reading its backward pass.