Learning Roadmap: C++ to GPU-Accelerated Inference

This roadmap lays out the full progression of the project in the order it was built and in the order you should follow it. Each phase has a concrete scope — specific topics, specific files, specific things you can compile and run — so that progress is always measurable and the next step is always clear. The four phases build on each other: the C++ skills from Phase 1 are the tools you use in Phase 2, the kernel patterns from Phase 2 are what you eventually parallelize in Phase 3, and Phase 4 assembles the full architecture using everything that came before.

Phase 1 — C++ Core (Weeks 1 & 2)

The foundation phase covers the C++ language features you need to write safe, efficient systems code. Every topic is its own standalone .cpp file that you compile and run independently.Language fundamentals covered:

I/O with cin, cout, getline, and printf
Primitive types and their sizes: int, char, short, long, long long — inspected with sizeof
Arrays, iteration (for, range-based for, while), functions, and header files with #ifndef include guards
Pointers (*, &), dereferencing, address printing with %p, and references as aliases

Memory management and ownership:

Structs with const fields; distinction between const int and const char*
Smart pointers from <memory>: unique_ptr, shared_ptr, weak_ptr, make_unique, make_shared
Ownership transfer with std::move, reference counting with use_count(), scope-based deallocation
Move semantics: custom String class, deep copy vs. move constructor, Lvalue vs. Rvalue

Templates and modern C++:

Function templates and class templates
C++20 Concepts — the requires expression, std::integral, std::floating_point
Custom concepts: Numeric and Addable built with requires expressions
STL: vector, unordered_map, <algorithm> (sort, count_if), lambdas as comparators
Memory layout: struct padding, alignof, alignas, why member order changes struct size

Build system:

CMake for multi-file builds — CMakeLists.txt sets C++20 standard and defines executable targets

# Single file
g++ -std=c++20 filename.cpp -o filename && ./filename

# Multi-file with CMake
mkdir build && cd build
cmake .. -G "MinGW Makefiles"
cmake --build .

Phase 2 — LLM Kernels in C (llm.c Study)

This phase studies Karpathy’s llm.c and reimplements the core computational kernels that underpin every LLM. Each kernel is a standalone C++ file focused on one operation, written with explicit flat-array indexing and no library dependencies beyond <math.h>.Kernels implemented:

matmul.cpp — matrix multiplication over flat 1D arrays using row-major index math (A[m*K + k], B[k*N + n]). Uses a local accumulator (float val) to accumulate the dot product before a single write to out[m*N + n]. Supports an optional bias vector.
layernorm.cpp — per-token normalization in four passes over a (B, T, C) tensor. Pass 1: compute mean. Pass 2: compute variance. Pass 3: compute rstd = 1 / sqrt(var + eps). Pass 4: normalize, scale by weight[i], shift by bias[i]. Token pointer computed as x + b*T*C + t*C.
softmax.cpp — numerically stable softmax in three passes. Pass 1: find max_val over C elements. Pass 2: compute exp(x[i] - max_val) and accumulate sum. Pass 3: divide each element by sum. The max subtraction prevents exp overflow without changing the output.

GPT-2 training code:

train_gpt2_annotated.c — a fully annotated reading of Karpathy’s GPT-2 training loop in C. Inline comments explain every non-obvious line: how the weight matrices are laid out, how the forward and backward passes share memory, and how the optimizer step is fused into a single loop over parameters.

The kernel implementations deliberately include the intermediate cout debug traces from the learning process. They are left in because they make the execution order visible — you can read what value is being computed at each step by running the file.

Phase 3 — GPU Fundamentals (PMPP + CUDA Mode)

This phase builds the mental model for GPU execution before writing device code. It combines textbook reading, lecture videos, and hands-on CUDA setup on real hardware.Reading — Programming Massively Parallel Processors (PMPP):

Chapter 1 — Introduction to parallel computing: why CPUs and GPUs have different design philosophies, throughput vs. latency optimization, where GPU acceleration is and isn’t beneficial
Chapter 2 — CUDA execution model: how kernels are launched, the thread → block → grid hierarchy, how the CUDA runtime maps this hierarchy to physical hardware
Chapter 3 — Memory model: global memory, shared memory, and registers; why memory access patterns dominate GPU performance

Lectures — CUDA Mode:

Lecture 1, 2, and 3: practical CUDA programming patterns, profiling basics, and the relationship between the PMPP execution model and real kernel code

Hands-on work:

CUDA environment setup: nvcc compiler, deviceQuery utility run on an RTX 4070 to inspect device properties (compute capability, SM count, memory bandwidth)
Hello-world CUDA kernel: each thread prints its thread ID and block ID to internalize how the execution hierarchy maps to individual work units
Vector add kernel: first real parallel computation — each thread computes one output element, result verified against a CPU reference implementation

Phase 3 requires a CUDA-capable GPU and the CUDA toolkit installed. The RTX 4070 used here has compute capability 8.9. If you are running on a different GPU, verify your device’s compute capability with deviceQuery before compiling kernels — the nvcc -arch flag must match.

Phase 4 — Transformer in C++ (Full Forward Pass)

The final phase ports a complete transformer from PyTorch to pure C++. Every component is implemented as a standalone function operating on flat float arrays — no matrix class, no autograd, no BLAS. The reference implementation is the author’s own PyTorch transformer at github.com/VrajPatel105/Transformer-Implementation-from-scratch-with-custom-dataset.Components implemented in model.cpp:

embeddings_forward — token lookup table scaled by sqrt(d_model). Given a token ID, fetches its row from the weight matrix: out[b*T*d_model + t*d_model + row] = weight[curr_token*d_model + row] * scale_factor
positional_encoding — sinusoidal PE added in-place using the standard formulas:
- PE[pos][2i] = sin(pos / 10000^(2i/d_model))
- PE[pos][2i+1] = cos(pos / 10000^(2i/d_model))
matmul — accumulator-pattern matrix multiplication with optional bias, reused throughout the model
layernorm — per-token normalization (mean, variance, rstd, normalize + scale + shift)
softmax — numerically stable three-pass softmax (max, exp+sum, normalize)
attention_forward — multi-head attention with fused QKV projection. Supports causal masking via a bool flag, and cross-attention with separate key/value inputs for the decoder
feedforward_forward — two matmuls with ReLU activation and 4× hidden dimension expansion
residual — element-wise addition of input and sublayer output
projection_forward — final linear projection from d_model to vocab_size
encoder_block — self-attention → residual → layernorm → FFN → residual → layernorm
decoder_block — masked self-attention → residual → layernorm → cross-attention → residual → layernorm → FFN → residual → layernorm
transformer_block — full forward pass: source and target embeddings + PE, N encoder blocks, N decoder blocks, final projection, softmax output

If you are following along, the recommended path through these docs mirrors the phase order above. Start with the C++ Core section to establish the language foundation, then move to LLM Kernels to see how those tools apply to real ML operations, then GPU Fundamentals for the CUDA execution model, and finally the Transformer section to see everything assembled. Each section links to the relevant source files so you can read the actual code alongside the documentation.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Learning Roadmap: C++ to GPU-Accelerated Inference

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

Build docs developers (and LLMs) love