GPU Architecture: Why GPUs Exist for Parallel Computation

GPUs were built for graphics: every pixel on your screen is an independent colour calculation, and a scene might contain millions of them that all need to be resolved before the next frame. The hardware answer was to stop making one core faster and instead build thousands of simpler cores that all work simultaneously. That same design choice — massive parallelism over single-thread speed — turns out to be exactly what matrix multiplication needs, which is why GPUs became the dominant substrate for deep-learning inference.

CPU vs. GPU Design Philosophy

The two architectures start from opposite assumptions about what “fast” means.

CPU: Latency-Optimized

A modern CPU has a small number of large, complex cores (typically 8–32). Each core is equipped with deep pipelines, out-of-order execution, branch predictors, and multiple layers of large cache. The entire design is aimed at finishing one sequential task as quickly as possible. A single CPU core can execute a chain of dependent instructions in nanoseconds — but it can only work on a handful of things at once.

GPU: Throughput-Optimized

A GPU like the RTX 4070 packs thousands of CUDA cores. Each core is far simpler — no branch predictor, tiny per-core cache — but because there are so many of them, the total work done per second is enormous as long as the work is independent and uniform. Latency per thread is high; aggregate throughput is unmatched.

This trade-off is deliberate. CPUs excel at workloads with complex control flow and data dependencies. GPUs excel at workloads that apply the same operation to a massive dataset — which is exactly the pattern of neural network inference.

Why ML Inference Fits the GPU Model

Every forward pass through a transformer layer is dominated by matrix multiplications. Consider an output matrix C = A × B of shape (M, N). Each element C[m][n] is the dot product of row m of A with column n of B. Critically, every output element is completely independent of every other output element. There are no data dependencies across the (m, n) pairs. This property — where a problem can be divided into independent sub-problems — is called embarrassingly parallel. A GPU can assign one thread to each (m, n) pair and compute thousands of dot products simultaneously, collapsing what would be nested loops on a CPU into a single parallel kernel launch. The flat index math from the C kernel module makes this concrete. In matmul.cpp, the CPU loops write results using:

out[m * N + n] = val;

On a GPU, that same formula becomes the address each thread independently writes to — thread (m, n) computes its dot product and stores the result at offset m * N + n. The math is identical; only the execution model changes from sequential to parallel.

Amdahl’s Law and the Case for Parallelism

PMPP Chapter 1 introduces the theoretical backbone: Amdahl’s Law. If a fraction s of a program is inherently sequential (cannot be parallelised), then no matter how many parallel processors you add, the maximum speedup is bounded by 1 / s. A program that is 10 % sequential can never be more than 10× faster regardless of how many cores you throw at it. The practical implication: to get large speedups you must make the parallel fraction of your program as large as possible — and eliminate bottlenecks in the sequential sections (data loading, host-to-device transfers, tokenisation). Inference workloads spend the overwhelming majority of their time in matrix multiplications and activation functions, both of which are highly parallel. That is the regime where GPU acceleration pays off.

The CUDA Programming Model

NVIDIA’s CUDA framework exposes the GPU as a flat collection of threads that are organised into a two-level hierarchy:

Threads are the individual workers. Each thread runs the same kernel function but with a unique ID it uses to select its piece of data.
Blocks group threads that can cooperate — they share a fast on-chip memory region and can synchronise with each other.
Grids are collections of blocks that together cover the entire problem.

This hierarchy maps directly onto the hardware. Blocks are executed on Streaming Multiprocessors (SMs) — the GPU’s compute units — and each SM runs multiple blocks concurrently by rapidly context-switching between groups of 32 threads called warps.

The hands-on work for this module uses an RTX 4070. Running deviceQuery after setting up the CUDA environment reveals the hardware spec — number of SMs, max threads per block, memory bandwidth, and compute capability. These numbers directly constrain how you design kernels.

Reading Map

The three PMPP chapters that underpin this module follow a natural progression:

Chapter 1 — Why Parallelism?

Motivation for parallel computing, Amdahl’s Law, and the architectural divergence between CPUs and GPUs.

Chapter 2 — CUDA Execution Model

How kernels are launched, how threads are organised into blocks and grids, and how each thread computes its global index.

Chapter 3 — Memory Hierarchy

Global, shared, and register memory — their sizes, latencies, and the tiling strategies that extract maximum performance from the hardware.

CUDA Mode Lectures 1–3 accompany PMPP and are part of the reading for this module. They are listed in the hands-on section alongside the textbook chapters and provide a complementary view of the material covered in the following pages.

The next page covers the CUDA execution model in depth: how kernels are written, how thread IDs work, and how the hello-world and vector-add exercises build the intuition you need before writing a GPU matmul kernel.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

GPU Architecture: Why GPUs Exist for Parallel Computation

CPU vs. GPU Design Philosophy

CPU: Latency-Optimized

GPU: Throughput-Optimized

Why ML Inference Fits the GPU Model

Amdahl’s Law and the Case for Parallelism

The CUDA Programming Model

Reading Map

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

​CPU vs. GPU Design Philosophy

CPU: Latency-Optimized

GPU: Throughput-Optimized

​Why ML Inference Fits the GPU Model

​Amdahl’s Law and the Case for Parallelism

​The CUDA Programming Model

​Reading Map

Build docs developers (and LLMs) love

CPU vs. GPU Design Philosophy

Why ML Inference Fits the GPU Model

Amdahl’s Law and the Case for Parallelism

The CUDA Programming Model

Reading Map