Most machine learning engineers never see what happens beneath the framework. PyTorch, NumPy, and CUDA libraries abstract away the flat arrays, index arithmetic, memory layout decisions, and kernel scheduling that make neural networks actually run. This project is a structured attempt to remove every one of those abstractions — starting from the very basics of C++ and working up through GPU programming to a complete transformer forward pass written in pure C++, with no frameworks, no autograd, and no hidden allocations. The result is four interconnected learning layers: a C++ foundation that covers everything from primitive types to C++20 concepts and CMake, a hands-on study of Karpathy’sDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
llm.c that reimplements the core LLM kernels in C++, a GPU architecture phase grounded in the PMPP textbook and CUDA Mode lectures, and a final C++ transformer port that reproduces every component of the architecture — embeddings, positional encoding, multi-head attention, feedforward blocks, encoder, decoder, and the full forward pass — in plain code you can read and compile yourself.
Why write a transformer in C++?
When you writenn.MultiheadAttention in PyTorch, the framework handles QKV projection, scaled dot-product attention, causal masking, output projection, and memory management in a single call. When you write it in C++, you are forced to answer questions that frameworks quietly answer for you: How do you index into a flat 1D array as if it were a 3D tensor? Why does matmul use a local accumulator before writing to memory? How do you implement numerically stable softmax without a library? What does a causal mask actually look like in memory?
Working through these questions from scratch builds intuitions that transfer directly to GPU kernel writing, quantization, and inference optimization — the skills that distinguish ML systems engineers from ML practitioners.
The four learning layers
C++ Core
Language fundamentals from scratch: types, pointers, smart pointers, move semantics, C++20 concepts, STL containers, and CMake for multi-file builds. Weeks 1 and 2 of the roadmap.
LLM Kernels in C++
Studying Karpathy’s llm.c and reimplementing the core kernels — matmul, layernorm, softmax — in standalone C++ files. Includes a fully annotated GPT-2 training loop.
GPU Fundamentals
Reading PMPP chapters 1–3, following CUDA Mode lectures 1–3, setting up nvcc on an RTX 4070, and writing hello-world and vector-add CUDA kernels from scratch.
Transformer in C++
Porting a complete PyTorch transformer to pure C++: 12 components from token embeddings to the full encoder-decoder stack, using only flat arrays and index math.
What each layer builds toward
The four phases are intentionally sequential. The C++ core material provides the memory model and language tools you need to write correct, non-leaking code. The llm.c study phase applies those tools to real ML kernels, so you understand the computational patterns before you parallelize them. The GPU fundamentals phase teaches the CUDA execution model — threads, blocks, grids, and memory hierarchy — that makes parallelization possible. The transformer phase assembles everything: a working forward pass you can step through in a debugger, where every allocation is explicit and every index is yours to reason about.Every module in this project is a set of standalone files you compile and run yourself. There are no Jupyter notebooks, no package installs beyond a C++ compiler and optionally nvcc, and no hidden state. Each file is self-contained and focused on one concept. The C++ core files compile with a single
g++ -std=c++20 invocation; the transformer components build together into one executable you can inspect with gdb or lldb.