Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

Every language model, from a small GPT-2 to a modern large model, is built from the same handful of operations: matrix multiplication, layer normalization, softmax, and token embeddings. This module implements each of those operations from scratch in pure C and C++, with no library dependencies, no autograd, and no abstraction layers. The goal is not performance — it is understanding. Before you can write a CUDA kernel for an operation, you must be able to write the same operation in a plain for loop and know exactly what the memory layout looks like.

Why Start in C?

GPU memory is flat. There is no “2D array” on a GPU, no object, no tensor struct. When you allocate memory on a GPU, you get a pointer to a contiguous block of floats — a 1D array. To treat it as a matrix, you compute an integer offset using shape arithmetic. This is identical to how C represents arrays, which means a C implementation of a kernel and its GPU equivalent share the same memory access pattern. Writing the CPU version first makes porting to CUDA mechanical: the loop body becomes the thread body, and the loop indices become thread coordinates.

What This Module Covers

Tensor Indexing

How multi-dimensional tensors are stored as flat 1D arrays, and the index arithmetic to access any element by shape. The prerequisite for every other page.

Matrix Multiplication

The three-loop matmul pattern with flat indexing, optional bias, and the local accumulator that minimises memory writes.

Layer Normalization

Per-token normalization: the four-step mean / variance / rstd / normalize pattern over a (B, T, C) shaped activation tensor.

Softmax

Numerically stable softmax using the three-pass max-subtract / exp-sum / divide algorithm, with a worked example.

GPT-2 Study

Reading and annotating Karpathy’s llm.c — encoder forward/backward, layernorm forward/backward, and the full 124 M parameter layout.

The Study Approach

The study follows two tracks in parallel. The first is reading Karpathy’s train_gpt2.c line by line, adding annotations that explain the tensor shapes, the flat-index formulas, and why each intermediate value is cached for the backward pass. The second is re-implementing each kernel from scratch — matmul.cpp, layernorm.cpp, softmax.cpp — with main() functions that print expected outputs so correctness can be verified by eye.
1

Read the annotated source

Start with train_gpt2_annotated.c. Every function has a comment block describing its tensor shapes, the math it performs, and how it maps to the GPT-2 architecture diagram.
2

Implement each kernel standalone

Each of matmul.cpp, layernorm.cpp, and softmax.cpp is a self-contained file with a main() that exercises the kernel and prints results you can check by hand.
3

Build the pybind11 bridge

Once a kernel is correct in C++, tensor-primitives/primitives.cpp shows how to expose it to Python via pybind11 so it can be called from a training loop or compared against PyTorch.
4

Port to CUDA

With a working CPU kernel and a clear understanding of the flat-index math, writing the CUDA version is a direct translation: replace loops with thread indices.

The pybind11 Bridge

tensor-primitives/primitives.cpp is a starter file that shows how to wrap a C++ function with pybind11 and expose it as a Python module. The CMakeLists.txt in the same directory uses pybind11_add_module to build a shared library called myprims.
#include <pybind11/pybind11.h>
namespace py = pybind11;

int add(int i, int j) {
    return i + j;
}

PYBIND11_MODULE(myprims, m) {
    m.def("add", &add, "A function that adds two numbers");
}
This starter currently wraps a trivial add function. The intent is to replace it with real kernel implementations — matmul, layernorm, softmax — so they can be imported and benchmarked directly from Python against PyTorch equivalents.

Build docs developers (and LLMs) love