C++ GPU Inference: Systems Programming for ML

C++ GPU Inference is a structured engineering study covering the full stack from C++ language fundamentals through GPU-accelerated ML inference — written in pure C and C++ with no frameworks, no autograd, and no abstractions hiding the underlying math. The project is organized into four progressively deeper layers: C++ language mastery, LLM kernel implementations in C (studying Karpathy’s llm.c), GPU architecture and CUDA foundations, and finally a complete transformer built from scratch including multi-head attention, encoder/decoder blocks, and the full forward pass.

Introduction

Understand what this project covers, the motivation behind it, and how the four learning layers connect.

Roadmap

See the full week-by-week progression from C++ basics to GPU inference.

C++ Core

Start with C++ fundamentals: types, pointers, smart pointers, move semantics, and C++20 concepts.

LLM Kernels

Implement matmul, layernorm, softmax, and embeddings in pure C — the building blocks of every LLM.

GPU Fundamentals

Learn the CUDA execution model: kernels, threads, blocks, grids, and the GPU memory hierarchy.

Transformer in C++

Walk through the complete transformer forward pass — attention, FFN, encoder, decoder — all in C++.

What’s inside

Master C++ Systems Programming

Work through standalone programs covering every low-level concept: pointers and references, struct padding and alignment, smart pointers with RAII semantics, move constructors, C++20 concepts, and the STL. Build with CMake for multi-file projects.

Implement LLM Kernels from Scratch

Write the core building blocks that every language model depends on — matrix multiplication with flat array indexing, layer normalization with mean/variance/rstd, numerically stable softmax, token embeddings, and positional encoding — all in pure C/C++.

Understand GPU Architecture

Study the CUDA execution model from first principles: why GPUs exist, how kernels map to threads and blocks, and how global/shared/register memory behaves under real workloads.

Build a Complete Transformer

Assemble every component — embeddings, positional encoding, multi-head attention (with causal masking and cross-attention), feed-forward networks, residual connections, layer norm, encoder and decoder blocks — into a full transformer forward pass verified against a PyTorch reference.

All code in this project runs on CPU unless noted. GPU kernels are studied conceptually via PMPP and CUDA Mode lectures; the transformer implementation is a clean CPU reference designed to make the math explicit before moving to CUDA.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

C++ GPU Inference: Systems Programming for ML

Introduction

Roadmap

C++ Core

LLM Kernels

GPU Fundamentals

Transformer in C++

What’s inside

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

Introduction

Roadmap

C++ Core

LLM Kernels

GPU Fundamentals

Transformer in C++

​What’s inside

Build docs developers (and LLMs) love

What’s inside