GPUs were built for graphics: every pixel on your screen is an independent colour calculation, and a scene might contain millions of them that all need to be resolved before the next frame. The hardware answer was to stop making one core faster and instead build thousands of simpler cores that all work simultaneously. That same design choice — massive parallelism over single-thread speed — turns out to be exactly what matrix multiplication needs, which is why GPUs became the dominant substrate for deep-learning inference.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
CPU vs. GPU Design Philosophy
The two architectures start from opposite assumptions about what “fast” means.CPU: Latency-Optimized
A modern CPU has a small number of large, complex cores (typically 8–32). Each core is equipped with deep pipelines, out-of-order execution, branch predictors, and multiple layers of large cache. The entire design is aimed at finishing one sequential task as quickly as possible. A single CPU core can execute a chain of dependent instructions in nanoseconds — but it can only work on a handful of things at once.
GPU: Throughput-Optimized
A GPU like the RTX 4070 packs thousands of CUDA cores. Each core is far simpler — no branch predictor, tiny per-core cache — but because there are so many of them, the total work done per second is enormous as long as the work is independent and uniform. Latency per thread is high; aggregate throughput is unmatched.
Why ML Inference Fits the GPU Model
Every forward pass through a transformer layer is dominated by matrix multiplications. Consider an output matrixC = A × B of shape (M, N). Each element C[m][n] is the dot product of row m of A with column n of B. Critically, every output element is completely independent of every other output element. There are no data dependencies across the (m, n) pairs.
This property — where a problem can be divided into independent sub-problems — is called embarrassingly parallel. A GPU can assign one thread to each (m, n) pair and compute thousands of dot products simultaneously, collapsing what would be nested loops on a CPU into a single parallel kernel launch.
The flat index math from the C kernel module makes this concrete. In matmul.cpp, the CPU loops write results using:
(m, n) computes its dot product and stores the result at offset m * N + n. The math is identical; only the execution model changes from sequential to parallel.
Amdahl’s Law and the Case for Parallelism
PMPP Chapter 1 introduces the theoretical backbone: Amdahl’s Law. If a fractions of a program is inherently sequential (cannot be parallelised), then no matter how many parallel processors you add, the maximum speedup is bounded by 1 / s. A program that is 10 % sequential can never be more than 10× faster regardless of how many cores you throw at it.
The practical implication: to get large speedups you must make the parallel fraction of your program as large as possible — and eliminate bottlenecks in the sequential sections (data loading, host-to-device transfers, tokenisation). Inference workloads spend the overwhelming majority of their time in matrix multiplications and activation functions, both of which are highly parallel. That is the regime where GPU acceleration pays off.
The CUDA Programming Model
NVIDIA’s CUDA framework exposes the GPU as a flat collection of threads that are organised into a two-level hierarchy:- Threads are the individual workers. Each thread runs the same kernel function but with a unique ID it uses to select its piece of data.
- Blocks group threads that can cooperate — they share a fast on-chip memory region and can synchronise with each other.
- Grids are collections of blocks that together cover the entire problem.
The hands-on work for this module uses an RTX 4070. Running
deviceQuery after setting up the CUDA environment reveals the hardware spec — number of SMs, max threads per block, memory bandwidth, and compute capability. These numbers directly constrain how you design kernels.Reading Map
The three PMPP chapters that underpin this module follow a natural progression:Chapter 1 — Why Parallelism?
Motivation for parallel computing, Amdahl’s Law, and the architectural divergence between CPUs and GPUs.
Chapter 2 — CUDA Execution Model
How kernels are launched, how threads are organised into blocks and grids, and how each thread computes its global index.
CUDA Mode Lectures 1–3 accompany PMPP and are part of the reading for this module. They are listed in the hands-on section alongside the textbook chapters and provide a complementary view of the material covered in the following pages.