C++ GPU Inference is a structured engineering study covering the full stack from C++ language fundamentals through GPU-accelerated ML inference — written in pure C and C++ with no frameworks, no autograd, and no abstractions hiding the underlying math. The project is organized into four progressively deeper layers: C++ language mastery, LLM kernel implementations in C (studying Karpathy’s llm.c), GPU architecture and CUDA foundations, and finally a complete transformer built from scratch including multi-head attention, encoder/decoder blocks, and the full forward pass.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
Understand what this project covers, the motivation behind it, and how the four learning layers connect.
Roadmap
See the full week-by-week progression from C++ basics to GPU inference.
C++ Core
Start with C++ fundamentals: types, pointers, smart pointers, move semantics, and C++20 concepts.
LLM Kernels
Implement matmul, layernorm, softmax, and embeddings in pure C — the building blocks of every LLM.
GPU Fundamentals
Learn the CUDA execution model: kernels, threads, blocks, grids, and the GPU memory hierarchy.
Transformer in C++
Walk through the complete transformer forward pass — attention, FFN, encoder, decoder — all in C++.
What’s inside
Master C++ Systems Programming
Work through standalone programs covering every low-level concept: pointers and references, struct padding and alignment, smart pointers with RAII semantics, move constructors, C++20 concepts, and the STL. Build with CMake for multi-file projects.
Implement LLM Kernels from Scratch
Write the core building blocks that every language model depends on — matrix multiplication with flat array indexing, layer normalization with mean/variance/rstd, numerically stable softmax, token embeddings, and positional encoding — all in pure C/C++.
Understand GPU Architecture
Study the CUDA execution model from first principles: why GPUs exist, how kernels map to threads and blocks, and how global/shared/register memory behaves under real workloads.
Build a Complete Transformer
Assemble every component — embeddings, positional encoding, multi-head attention (with causal masking and cross-attention), feed-forward networks, residual connections, layer norm, encoder and decoder blocks — into a full transformer forward pass verified against a PyTorch reference.
All code in this project runs on CPU unless noted. GPU kernels are studied conceptually via PMPP and CUDA Mode lectures; the transformer implementation is a clean CPU reference designed to make the math explicit before moving to CUDA.