C++ GPU Inference: Systems ML from First Principles

Most machine learning engineers never see what happens beneath the framework. PyTorch, NumPy, and CUDA libraries abstract away the flat arrays, index arithmetic, memory layout decisions, and kernel scheduling that make neural networks actually run. This project is a structured attempt to remove every one of those abstractions — starting from the very basics of C++ and working up through GPU programming to a complete transformer forward pass written in pure C++, with no frameworks, no autograd, and no hidden allocations. The result is four interconnected learning layers: a C++ foundation that covers everything from primitive types to C++20 concepts and CMake, a hands-on study of Karpathy’s llm.c that reimplements the core LLM kernels in C++, a GPU architecture phase grounded in the PMPP textbook and CUDA Mode lectures, and a final C++ transformer port that reproduces every component of the architecture — embeddings, positional encoding, multi-head attention, feedforward blocks, encoder, decoder, and the full forward pass — in plain code you can read and compile yourself.

Why write a transformer in C++?

When you write nn.MultiheadAttention in PyTorch, the framework handles QKV projection, scaled dot-product attention, causal masking, output projection, and memory management in a single call. When you write it in C++, you are forced to answer questions that frameworks quietly answer for you: How do you index into a flat 1D array as if it were a 3D tensor? Why does matmul use a local accumulator before writing to memory? How do you implement numerically stable softmax without a library? What does a causal mask actually look like in memory? Working through these questions from scratch builds intuitions that transfer directly to GPU kernel writing, quantization, and inference optimization — the skills that distinguish ML systems engineers from ML practitioners.

The four learning layers

C++ Core

Language fundamentals from scratch: types, pointers, smart pointers, move semantics, C++20 concepts, STL containers, and CMake for multi-file builds. Weeks 1 and 2 of the roadmap.

LLM Kernels in C++

Studying Karpathy’s llm.c and reimplementing the core kernels — matmul, layernorm, softmax — in standalone C++ files. Includes a fully annotated GPT-2 training loop.

GPU Fundamentals

Reading PMPP chapters 1–3, following CUDA Mode lectures 1–3, setting up nvcc on an RTX 4070, and writing hello-world and vector-add CUDA kernels from scratch.

Transformer in C++

Porting a complete PyTorch transformer to pure C++: 12 components from token embeddings to the full encoder-decoder stack, using only flat arrays and index math.

What each layer builds toward

The four phases are intentionally sequential. The C++ core material provides the memory model and language tools you need to write correct, non-leaking code. The llm.c study phase applies those tools to real ML kernels, so you understand the computational patterns before you parallelize them. The GPU fundamentals phase teaches the CUDA execution model — threads, blocks, grids, and memory hierarchy — that makes parallelization possible. The transformer phase assembles everything: a working forward pass you can step through in a debugger, where every allocation is explicit and every index is yours to reason about.

Every module in this project is a set of standalone files you compile and run yourself. There are no Jupyter notebooks, no package installs beyond a C++ compiler and optionally nvcc, and no hidden state. Each file is self-contained and focused on one concept. The C++ core files compile with a single g++ -std=c++20 invocation; the transformer components build together into one executable you can inspect with gdb or lldb.

Who this is for

This project is useful if you already write ML code in Python and want to understand what is happening under the hood — not as a theoretical exercise, but through working implementations you can read and modify. It is also useful if you are preparing to work on inference engines, custom CUDA kernels, or quantized model serving, where the ability to reason about memory layout and execution cost is a practical requirement. No prior C++ experience is assumed, but comfort with basic programming concepts and some exposure to transformer architecture will help you move faster.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

C++ GPU Inference: Systems ML from First Principles

Why write a transformer in C++?

The four learning layers

C++ Core

LLM Kernels in C++

GPU Fundamentals

Transformer in C++

What each layer builds toward

Who this is for

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

​Why write a transformer in C++?

​The four learning layers

C++ Core

LLM Kernels in C++

GPU Fundamentals

Transformer in C++

​What each layer builds toward

​Who this is for

Build docs developers (and LLMs) love

Why write a transformer in C++?

The four learning layers

What each layer builds toward

Who this is for