CUDA programs don’t loop — they launch. Instead of writing aDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
for loop that iterates over every element of an array, you write a kernel: a function that describes the work for a single element, and then ask the GPU to run that function on thousands of elements at once. Every thread executes the same code, but each one carries a unique identity — its thread and block indices — that it uses to select exactly which piece of data to operate on. Internalising this shift from sequential iteration to parallel dispatch is the first and most important conceptual step in GPU programming.
The Thread Hierarchy
CUDA organises threads into a three-level structure that maps onto the physical hardware.Thread
The atomic unit of execution. Each thread runs one copy of the kernel function. It has its own registers and a private
threadIdx variable telling it where it sits within its block.Block
A group of up to 1,024 threads that are guaranteed to run on the same Streaming Multiprocessor. Threads in the same block can communicate via shared memory and synchronise with
__syncthreads(). Each block knows its position in the grid via blockIdx.Grid
All the blocks launched by a single kernel call. The grid can be 1D, 2D, or 3D. Together the blocks cover the entire problem — every input element gets exactly one thread assigned to it.
Computing a Global Thread Index
Within a block,threadIdx.x runs from 0 to blockDim.x - 1. Across the grid, blockIdx.x identifies the block. To get the unique global index of a thread — the position in the overall array it should process — you combine them:
blockIdx.x is the building number, blockDim.x is how many apartments are in each building, and threadIdx.x is the apartment number within that building. Every thread in the entire grid gets a unique idx.
This is the single formula that drives almost every 1D CUDA kernel. You will write it hundreds of times.
Hands-On Exercise 1 — Hello-World Kernel
The first hands-on task in PMPP Chapter 2 is deliberately trivial: launch a kernel where every thread prints its own identity. There is no real computation — the goal is to see the execution model in action and confirm that the thread and block IDs match expectations.<<<gridDim, blockDim>>> syntax is CUDA’s kernel launch configuration. Here 2 means 2 blocks and 4 means 4 threads per block. The output will show 8 lines, one per thread, with thread IDs 0–3 appearing twice (once per block). The order of lines is non-deterministic — blocks and warps are scheduled independently, which is a direct consequence of the parallel execution model.
Hands-On Exercise 2 — Vector Add Kernel
Vector addition is the canonical first real CUDA kernel. Each thread reads one element fromA and one from B, adds them, and writes the result to C. With N threads launched, all N additions happen in parallel.
N = 1 << 20 (one million elements):
(N + threads_per_block - 1) / threads_per_block ensures the last partial block is included even if N is not a multiple of threads_per_block.
After running the kernel, copy d_C back to the host and compare element-by-element against a CPU reference to verify correctness. This verify-against-CPU pattern is essential discipline — GPU bugs are subtle and a wrong result often looks plausible.
Connection to Matmul: The Same Index Math
The flat index formulam * N + n from matmul.cpp is not a coincidence — it is exactly the pattern every CUDA thread uses to locate its output element. The comment in that file captures it precisely:
A GPU also has flat memory. Each CUDA thread gets assigned one output positionIn a 2D matmul kernel, the launch configuration would use a 2D grid so that each thread naturally maps to one(m, n)and usesm*N + nto find where to write its result. Same formula, millions of threads running in parallel. The math never changes. Only the execution model does.
(m, n) output position:
k, with the m and n loops replaced by the thread grid.
PMPP Chapter 2 in Context
PMPP Chapter 2 covers the execution model in full detail: warp scheduling, SIMD execution, divergence (what happens when threads in the same warp take different branches), and occupancy (how many warps can be live on an SM simultaneously). The hands-on exercises above build the concrete intuition that makes those abstract concepts legible.CUDA Mode Lectures 1–3 are listed alongside PMPP as part of the reading for this module. They supplement the textbook and are worth working through after completing the hands-on exercises here.