This roadmap lays out the full progression of the project in the order it was built and in the order you should follow it. Each phase has a concrete scope — specific topics, specific files, specific things you can compile and run — so that progress is always measurable and the next step is always clear. The four phases build on each other: the C++ skills from Phase 1 are the tools you use in Phase 2, the kernel patterns from Phase 2 are what you eventually parallelize in Phase 3, and Phase 4 assembles the full architecture using everything that came before.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
Phase 1 — C++ Core (Weeks 1 & 2)
The foundation phase covers the C++ language features you need to write safe, efficient systems code. Every topic is its own standalone
.cpp file that you compile and run independently.Language fundamentals covered:- I/O with
cin,cout,getline, andprintf - Primitive types and their sizes:
int,char,short,long,long long— inspected withsizeof - Arrays, iteration (for, range-based for, while), functions, and header files with
#ifndefinclude guards - Pointers (
*,&), dereferencing, address printing with%p, and references as aliases
- Structs with
constfields; distinction betweenconst intandconst char* - Smart pointers from
<memory>:unique_ptr,shared_ptr,weak_ptr,make_unique,make_shared - Ownership transfer with
std::move, reference counting withuse_count(), scope-based deallocation - Move semantics: custom
Stringclass, deep copy vs. move constructor, Lvalue vs. Rvalue
- Function templates and class templates
- C++20 Concepts — the
requiresexpression,std::integral,std::floating_point - Custom concepts:
NumericandAddablebuilt withrequiresexpressions - STL:
vector,unordered_map,<algorithm>(sort,count_if), lambdas as comparators - Memory layout: struct padding,
alignof,alignas, why member order changes struct size
- CMake for multi-file builds —
CMakeLists.txtsets C++20 standard and defines executable targets
Phase 2 — LLM Kernels in C (llm.c Study)
This phase studies Karpathy’s
llm.c and reimplements the core computational kernels that underpin every LLM. Each kernel is a standalone C++ file focused on one operation, written with explicit flat-array indexing and no library dependencies beyond <math.h>.Kernels implemented:-
matmul.cpp— matrix multiplication over flat 1D arrays using row-major index math (A[m*K + k],B[k*N + n]). Uses a local accumulator (float val) to accumulate the dot product before a single write toout[m*N + n]. Supports an optional bias vector. -
layernorm.cpp— per-token normalization in four passes over a(B, T, C)tensor. Pass 1: compute mean. Pass 2: compute variance. Pass 3: computerstd = 1 / sqrt(var + eps). Pass 4: normalize, scale byweight[i], shift bybias[i]. Token pointer computed asx + b*T*C + t*C. -
softmax.cpp— numerically stable softmax in three passes. Pass 1: findmax_valoverCelements. Pass 2: computeexp(x[i] - max_val)and accumulatesum. Pass 3: divide each element bysum. The max subtraction preventsexpoverflow without changing the output.
train_gpt2_annotated.c— a fully annotated reading of Karpathy’s GPT-2 training loop in C. Inline comments explain every non-obvious line: how the weight matrices are laid out, how the forward and backward passes share memory, and how the optimizer step is fused into a single loop over parameters.
The kernel implementations deliberately include the intermediate
cout debug traces from the learning process. They are left in because they make the execution order visible — you can read what value is being computed at each step by running the file.Phase 3 — GPU Fundamentals (PMPP + CUDA Mode)
This phase builds the mental model for GPU execution before writing device code. It combines textbook reading, lecture videos, and hands-on CUDA setup on real hardware.Reading — Programming Massively Parallel Processors (PMPP):
- Chapter 1 — Introduction to parallel computing: why CPUs and GPUs have different design philosophies, throughput vs. latency optimization, where GPU acceleration is and isn’t beneficial
- Chapter 2 — CUDA execution model: how kernels are launched, the thread → block → grid hierarchy, how the CUDA runtime maps this hierarchy to physical hardware
- Chapter 3 — Memory model: global memory, shared memory, and registers; why memory access patterns dominate GPU performance
- Lecture 1, 2, and 3: practical CUDA programming patterns, profiling basics, and the relationship between the PMPP execution model and real kernel code
- CUDA environment setup:
nvcccompiler,deviceQueryutility run on an RTX 4070 to inspect device properties (compute capability, SM count, memory bandwidth) - Hello-world CUDA kernel: each thread prints its thread ID and block ID to internalize how the execution hierarchy maps to individual work units
- Vector add kernel: first real parallel computation — each thread computes one output element, result verified against a CPU reference implementation
Phase 4 — Transformer in C++ (Full Forward Pass)
The final phase ports a complete transformer from PyTorch to pure C++. Every component is implemented as a standalone function operating on flat float arrays — no matrix class, no autograd, no BLAS. The reference implementation is the author’s own PyTorch transformer at
github.com/VrajPatel105/Transformer-Implementation-from-scratch-with-custom-dataset.Components implemented in model.cpp:-
embeddings_forward— token lookup table scaled bysqrt(d_model). Given a token ID, fetches its row from the weight matrix:out[b*T*d_model + t*d_model + row] = weight[curr_token*d_model + row] * scale_factor -
positional_encoding— sinusoidal PE added in-place using the standard formulas:PE[pos][2i] = sin(pos / 10000^(2i/d_model))PE[pos][2i+1] = cos(pos / 10000^(2i/d_model))
-
matmul— accumulator-pattern matrix multiplication with optional bias, reused throughout the model -
layernorm— per-token normalization (mean, variance, rstd, normalize + scale + shift) -
softmax— numerically stable three-pass softmax (max, exp+sum, normalize) -
attention_forward— multi-head attention with fused QKV projection. Supports causal masking via a bool flag, and cross-attention with separate key/value inputs for the decoder -
feedforward_forward— two matmuls with ReLU activation and 4× hidden dimension expansion -
residual— element-wise addition of input and sublayer output -
projection_forward— final linear projection fromd_modeltovocab_size -
encoder_block— self-attention → residual → layernorm → FFN → residual → layernorm -
decoder_block— masked self-attention → residual → layernorm → cross-attention → residual → layernorm → FFN → residual → layernorm -
transformer_block— full forward pass: source and target embeddings + PE, N encoder blocks, N decoder blocks, final projection, softmax output