Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

train_gpt2_annotated.c is Karpathy’s GPT-2 training code with line-by-line annotations added to explain the tensor shapes, the flat-index formulas, why each intermediate value is kept, and how each function maps to the GPT-2 architecture. The goal is to understand every forward and backward pass function before writing CUDA versions of them. Reading the C code — where the loops are explicit and the memory layout is undisguised — is a far more direct path to that understanding than reading framework source code.

What llm.c Is

llm.c is Andrej Karpathy’s minimal, CPU-only GPT-2 trainer written in a single C file. There is no framework, no automatic differentiation, no tensor abstraction layer. Every forward pass is a plain C function that takes float* pointers and integers. Every backward pass is the manually derived gradient of the corresponding forward pass, written as another plain C function. The code is readable in a way that framework-backed implementations cannot be: you can see the exact loops, the exact indices, and the exact memory access pattern. The annotated version adds explanation blocks above every major function describing:
  • The shape of every input and output tensor
  • The flat-index formula used to address each element
  • Where in the GPT-2 architecture the function executes
  • What intermediate values the forward pass must cache for the backward pass

Key Functions Annotated

encoder_forward

Combines token and position embeddings. For each (b, t) position, looks up the token row from wte and the position row from wpe, element-wise adds them, and writes to out[b, t, :].

encoder_backward

Accumulates gradients back into dwte and dwpe. For each (b, t), adds the upstream gradient dout[b, t, :] to both dwte[token_id, :] and dwpe[t, :].

layernorm_forward

Same four-step mean/variance/rstd/normalize pattern as the standalone layernorm.cpp, but caches mean and rstd in separate (B, T) buffers for use in the backward pass.

layernorm_backward

Two-pass gradient computation. First pass accumulates dnorm_mean and dnorm_norm_mean across the C dimension. Second pass uses those two scalars to compute the gradient for each channel.

encoder_forward: Token + Position Embeddings

The first function every token passes through. Its annotation in the source explains the index pattern that every subsequent function also uses:
// encoder_forward: combines token + position embeddings into the model's initial input.
//
// For each (b, t) position:
//   1. Look up the token ID from inp[b, t]
//   2. Get that token's row from wte (token embedding table)
//   3. Get position t's row from wpe (position embedding table)
//   4. Element-wise add the two C-dim vectors, write to out[b, t, :]
//
// Math:   out[b, t, :] = wte[inp[b, t], :] + wpe[t, :]

void encoder_forward(float* out,
                     int* inp, float* wte, float* wpe,
                     int B, int T, int C) {
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // seek to the output position in out[b,t,:]
            float* out_bt = out + b * T * C + t * C;
            // get the index of the token at inp[b, t]
            int ix = inp[b * T + t];
            // seek to the position in wte corresponding to the token
            float* wte_ix = wte + ix * C;
            // seek to the position in wpe corresponding to the position
            float* wpe_t = wpe + t * C;
            // add the two vectors and store the result in out[b,t,:]
            for (int i = 0; i < C; i++) {
                out_bt[i] = wte_ix[i] + wpe_t[i];
            }
        }
    }
}
wte is the token embedding table with shape (V, C) — one C-dimensional vector per vocabulary entry. wpe is the position embedding table with shape (maxT, C) — one C-dimensional vector per position. The lookup into wte uses the integer token ID ix, giving an offset of ix * C. The lookup into wpe uses the position t, giving an offset of t * C.

Flat Indexing: The Consistent Pattern

Every function in train_gpt2_annotated.c follows the same indexing rule. For a tensor of shape (D1, D2, D3), element (i1, i2, i3) is at flat offset:
i1 * (D2 * D3) + i2 * D3 + i3
To get a pointer to a token’s full C-dimensional vector — which is the most common operation in the file — drop the last index:
float* vec = tensor + i1 * D2 * D3 + i2 * D3;
// vec[i] now accesses element (i1, i2, i) for i in 0..D3-1
This pointer-into-flat-array idiom appears without variation in encoder_forward, encoder_backward, layernorm_forward, layernorm_backward, matmul_forward, matmul_backward, attention_forward, and every other function in the file.

GPT-2 Parameter Layout

temp.c studies the malloc_and_point_parameters function from llm.c by printing the cumulative parameter count as each parameter tensor is added. The param_sizes array holds the size of each of the 16 parameter tensors in GPT-2:
int param_sizes[16] = {
    38633472,   // wte: token embedding table (V=50304, C=768) → 50304*768 (vocab padded to multiple of 64)
    786432,     // wpe: position embedding table (maxT=1024, C=768) → 1024*768
    9216,       // ln1w: layernorm 1 weights, 12 layers × 768 channels
    9216,       // ln1b: layernorm 1 biases
    21233664,   // qkvw: QKV projection weights
    27648,      // qkvb: QKV projection biases
    7077888,    // attprojw: attention output projection weights
    9216,       // attprojb: attention output projection biases
    9216,       // ln2w: layernorm 2 weights
    9216,       // ln2b: layernorm 2 biases
    28311552,   // fcw: feed-forward up-projection weights
    36864,      // fcb: feed-forward up-projection biases
    28311552,   // fcprojw: feed-forward down-projection weights
    9216,       // fcprojb: feed-forward down-projection biases
    768,        // lnfw: final layernorm weight
    768         // lnfb: final layernorm bias
};
The cumulative count reaches 124,475,904 — approximately 124 million parameters, matching the published GPT-2 small model size. Each parameter is a float (4 bytes), so the full parameter tensor requires roughly 475 MB.
// Output from temp.c:
// At i=0 , num_parameters = 0
// At i=1 , num_parameters = 38633472
// At i=2 , num_parameters = 39419904
// At i=3 , num_parameters = 39429120
// ...
// At i=15, num_parameters = 124475136

The pybind11 Bridge

tensor-primitives/primitives.cpp and its CMakeLists.txt provide a starting point for exposing C++ kernel implementations to Python via pybind11. The current file wraps a trivial add function, but the structure is identical to what would be needed to wrap matmul, layernorm, or softmax:
#include <pybind11/pybind11.h>
namespace py = pybind11;

int add(int i, int j) {
    return i + j;
}

PYBIND11_MODULE(myprims, m) {
    m.def("add", &add, "A function that adds two numbers");
}
The CMake configuration uses pybind11_add_module to build the shared library:
cmake_minimum_required(VERSION 3.10)
project(tensor_primitives)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(PYBIND11_FINDPYTHON ON)
find_package(pybind11 REQUIRED)
pybind11_add_module(myprims primitives.cpp)
Once built, the module can be imported in Python as import myprims and the exposed functions called directly, enabling benchmarking against PyTorch equivalents.
Reading each forward and backward pass together is essential. The backward pass reveals exactly what intermediate values the forward pass must cache. For example, layernorm_forward caches mean and rstd in separate (B, T) buffers that are not needed for the output but are required by layernorm_backward to compute the input gradient efficiently. You cannot understand the caching decisions in a forward pass without first reading its backward pass.

Build docs developers (and LLMs) love