Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

CUDA programs don’t loop — they launch. Instead of writing a for loop that iterates over every element of an array, you write a kernel: a function that describes the work for a single element, and then ask the GPU to run that function on thousands of elements at once. Every thread executes the same code, but each one carries a unique identity — its thread and block indices — that it uses to select exactly which piece of data to operate on. Internalising this shift from sequential iteration to parallel dispatch is the first and most important conceptual step in GPU programming.

The Thread Hierarchy

CUDA organises threads into a three-level structure that maps onto the physical hardware.

Thread

The atomic unit of execution. Each thread runs one copy of the kernel function. It has its own registers and a private threadIdx variable telling it where it sits within its block.

Block

A group of up to 1,024 threads that are guaranteed to run on the same Streaming Multiprocessor. Threads in the same block can communicate via shared memory and synchronise with __syncthreads(). Each block knows its position in the grid via blockIdx.

Grid

All the blocks launched by a single kernel call. The grid can be 1D, 2D, or 3D. Together the blocks cover the entire problem — every input element gets exactly one thread assigned to it.
The hardware analogue: each block is scheduled onto a Streaming Multiprocessor (SM). The SM partitions the block into warps of 32 threads and executes them in lock-step on its SIMD units. This is why thread counts are typically chosen as multiples of 32.

Computing a Global Thread Index

Within a block, threadIdx.x runs from 0 to blockDim.x - 1. Across the grid, blockIdx.x identifies the block. To get the unique global index of a thread — the position in the overall array it should process — you combine them:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
Think of it like a street address: blockIdx.x is the building number, blockDim.x is how many apartments are in each building, and threadIdx.x is the apartment number within that building. Every thread in the entire grid gets a unique idx. This is the single formula that drives almost every 1D CUDA kernel. You will write it hundreds of times.

Hands-On Exercise 1 — Hello-World Kernel

The first hands-on task in PMPP Chapter 2 is deliberately trivial: launch a kernel where every thread prints its own identity. There is no real computation — the goal is to see the execution model in action and confirm that the thread and block IDs match expectations.
#include <cstdio>

__global__ void hello_kernel() {
    printf("Thread %d in block %d\n", threadIdx.x, blockIdx.x);
}

int main() {
    // Launch 2 blocks of 4 threads each → 8 threads total
    hello_kernel<<<2, 4>>>();
    cudaDeviceSynchronize();
    return 0;
}
The <<<gridDim, blockDim>>> syntax is CUDA’s kernel launch configuration. Here 2 means 2 blocks and 4 means 4 threads per block. The output will show 8 lines, one per thread, with thread IDs 0–3 appearing twice (once per block). The order of lines is non-deterministic — blocks and warps are scheduled independently, which is a direct consequence of the parallel execution model.
Run this with nvcc hello.cu -o hello && ./hello after setting up your CUDA environment. Redirect output to a file (./hello > out.txt) then sort it — you’ll see all 8 threads accounted for even though their print order was unpredictable.

Hands-On Exercise 2 — Vector Add Kernel

Vector addition is the canonical first real CUDA kernel. Each thread reads one element from A and one from B, adds them, and writes the result to C. With N threads launched, all N additions happen in parallel.
__global__ void vector_add(float* A, float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}
A typical host-side launch for an array of length N = 1 << 20 (one million elements):
int N = 1 << 20;
int threads_per_block = 256;
int num_blocks = (N + threads_per_block - 1) / threads_per_block;

vector_add<<<num_blocks, threads_per_block>>>(d_A, d_B, d_C, N);
The ceiling division (N + threads_per_block - 1) / threads_per_block ensures the last partial block is included even if N is not a multiple of threads_per_block.
Always include the bounds check if (idx < N). The last block will almost always contain more threads than there are remaining data elements. Without the guard, those out-of-bounds threads will read and write arbitrary memory, producing silent corruption or a segfault.
After running the kernel, copy d_C back to the host and compare element-by-element against a CPU reference to verify correctness. This verify-against-CPU pattern is essential discipline — GPU bugs are subtle and a wrong result often looks plausible.

Connection to Matmul: The Same Index Math

The flat index formula m * N + n from matmul.cpp is not a coincidence — it is exactly the pattern every CUDA thread uses to locate its output element. The comment in that file captures it precisely:
A GPU also has flat memory. Each CUDA thread gets assigned one output position (m, n) and uses m*N + n to find where to write its result. Same formula, millions of threads running in parallel. The math never changes. Only the execution model does.
In a 2D matmul kernel, the launch configuration would use a 2D grid so that each thread naturally maps to one (m, n) output position:
__global__ void matmul_kernel(float* A, float* B, float* C,
                               int M, int K, int N) {
    int m = blockIdx.y * blockDim.y + threadIdx.y;
    int n = blockIdx.x * blockDim.x + threadIdx.x;

    if (m < M && n < N) {
        float val = 0.0f;
        for (int k = 0; k < K; k++) {
            val += A[m * K + k] * B[k * N + n];
        }
        C[m * N + n] = val;
    }
}
Each thread independently computes the full dot product for its assigned output cell — the three nested loops from the CPU version collapse to a single loop over k, with the m and n loops replaced by the thread grid.

PMPP Chapter 2 in Context

PMPP Chapter 2 covers the execution model in full detail: warp scheduling, SIMD execution, divergence (what happens when threads in the same warp take different branches), and occupancy (how many warps can be live on an SM simultaneously). The hands-on exercises above build the concrete intuition that makes those abstract concepts legible.
CUDA Mode Lectures 1–3 are listed alongside PMPP as part of the reading for this module. They supplement the textbook and are worth working through after completing the hands-on exercises here.
The next page moves from execution to memory: once you know which thread does which work, the bottleneck almost always becomes how fast that thread can read and write data.

Build docs developers (and LLMs) love