Writing a correct CUDA kernel is the easy part. Writing a fast one almost always comes down to memory. The GPU has vast computational throughput, but that throughput is wasted the moment threads are sitting idle waiting for data to arrive from slow memory. Understanding which memory tier to use — and designing your access patterns accordingly — is the difference between a kernel that is compute-bound (doing real work) and one that is bandwidth-bound (waiting on memory transfers). PMPP Chapter 3 builds this understanding from the ground up.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt
Use this file to discover all available pages before exploring further.
The Three Memory Tiers
CUDA exposes three distinct memory spaces to a kernel. Each has a different scope, size, and latency profile.- Global Memory
- Registers
Global memory is the large DRAM pool on the GPU board — several gigabytes on a card like the RTX 4070. It is accessible by every thread in every block across the entire grid, and it persists for the lifetime of the CUDA context. All
cudaMalloc allocations and host-to-device transfers land here.The cost: global memory latency is very high — PMPP Chapter 3 characterises it as hundreds of cycles. A thread that loads a value from global memory and immediately uses it will stall waiting for the data to arrive. The GPU hides some of this latency by switching to another ready warp, but when all warps are waiting on memory, the compute units go idle.The Accumulator Pattern — Registers Doing Real Work
Thematmul.cpp inner loop from the C kernel module uses exactly this pattern:
val is a local scalar that accumulates the dot product across all K steps. On a CPU this lives on the stack; in a CUDA kernel it lives in a register. The critical observation: global memory is touched exactly twice per output element — K reads from A and K reads from B, with one write to C — while the accumulation itself happens entirely in the register file at zero memory cost. Moving the write to C outside the inner loop is not just style; it is a deliberate choice to keep the hot path in fast registers.
Tiled Matrix Multiplication with Shared Memory
The naive matmul kernel shown in the execution model page readsA and B directly from global memory on every iteration of the k loop. For a matrix of size M × K × N, this means:
- Each element of
Ais readNtimes (once per output column). - Each element of
Bis readMtimes (once per output row).
O(M × K × N) — far more than the data actually present. Tiling eliminates this redundancy by loading a small block (tile) of A and B into shared memory once and reusing it across all the threads in a block.
Partition the output matrix into tiles
Divide
C into TILE_SIZE × TILE_SIZE sub-matrices. Assign one block of threads to compute each tile.Iterate over tiles along the K dimension
For each tile step along
K, all threads in the block cooperatively load a TILE_SIZE × TILE_SIZE patch of A and B from global memory into shared memory. Each thread loads exactly one element — a single global memory read.Synchronise before computing
Call
__syncthreads() to ensure every thread has finished writing to shared memory before any thread begins reading from it.Accumulate the partial dot product
Each thread iterates over the
TILE_SIZE elements in its tile, reading from shared memory (fast on-chip access). It accumulates into a register variable.Connection to the CPU Transformer Code
Understanding where GPU memory concepts map onto the CPU transformer kernels studied earlier helps build a mental bridge between the two execution models. The indexing math stays the same — only the physical location of that memory, and the cost of accessing it, changes.In the CPU transformer (
llm.c study), matrices are allocated with new float[] — plain heap memory that the CPU accesses through its cache hierarchy. When these kernels are ported to the GPU, those allocations become cudaMalloc calls and the pointers point into global memory. The indexing math (m * N + n, b * T * C + t * C + c, etc.) is unchanged — what changes is where that memory physically lives and how fast it is to access.| Memory | Scope | Size | Latency | Declared with |
|---|---|---|---|---|
| Register | Per-thread | Small, finite per SM | Lowest (~1 cycle) | Automatic (local vars) |
| Shared | Per-block | On-chip SRAM, limited per block | Low (on-chip) | __shared__ |
| Global | All threads | GBs (device DRAM) | High (hundreds of cycles) | cudaMalloc |
CUDA Mode Lectures 1–3 are part of the reading for this module alongside PMPP Chapters 1–3. Work through them after the hands-on exercises — they provide practical context for the memory hierarchy and execution model concepts covered across this GPU fundamentals section.