Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/VrajPatel105/cpp-gpu-inference/llms.txt

Use this file to discover all available pages before exploring further.

Writing a correct CUDA kernel is the easy part. Writing a fast one almost always comes down to memory. The GPU has vast computational throughput, but that throughput is wasted the moment threads are sitting idle waiting for data to arrive from slow memory. Understanding which memory tier to use — and designing your access patterns accordingly — is the difference between a kernel that is compute-bound (doing real work) and one that is bandwidth-bound (waiting on memory transfers). PMPP Chapter 3 builds this understanding from the ground up.

The Three Memory Tiers

CUDA exposes three distinct memory spaces to a kernel. Each has a different scope, size, and latency profile.
Global memory is the large DRAM pool on the GPU board — several gigabytes on a card like the RTX 4070. It is accessible by every thread in every block across the entire grid, and it persists for the lifetime of the CUDA context. All cudaMalloc allocations and host-to-device transfers land here.The cost: global memory latency is very high — PMPP Chapter 3 characterises it as hundreds of cycles. A thread that loads a value from global memory and immediately uses it will stall waiting for the data to arrive. The GPU hides some of this latency by switching to another ready warp, but when all warps are waiting on memory, the compute units go idle.
// Global memory access — potentially hundreds of cycles of latency
float val = A[m * K + k];  // A lives in global memory (cudaMalloc'd)

The Accumulator Pattern — Registers Doing Real Work

The matmul.cpp inner loop from the C kernel module uses exactly this pattern:
float val = (bias != nullptr) ? bias[n] : 0.0f;
for (int k = 0; k < K; k++) {
    val += A[m * K + k] * B[k * N + n];
}
out[m * N + n] = val;
val is a local scalar that accumulates the dot product across all K steps. On a CPU this lives on the stack; in a CUDA kernel it lives in a register. The critical observation: global memory is touched exactly twice per output element — K reads from A and K reads from B, with one write to C — while the accumulation itself happens entirely in the register file at zero memory cost. Moving the write to C outside the inner loop is not just style; it is a deliberate choice to keep the hot path in fast registers.

Tiled Matrix Multiplication with Shared Memory

The naive matmul kernel shown in the execution model page reads A and B directly from global memory on every iteration of the k loop. For a matrix of size M × K × N, this means:
  • Each element of A is read N times (once per output column).
  • Each element of B is read M times (once per output row).
The total global memory traffic is O(M × K × N) — far more than the data actually present. Tiling eliminates this redundancy by loading a small block (tile) of A and B into shared memory once and reusing it across all the threads in a block.
1

Partition the output matrix into tiles

Divide C into TILE_SIZE × TILE_SIZE sub-matrices. Assign one block of threads to compute each tile.
2

Iterate over tiles along the K dimension

For each tile step along K, all threads in the block cooperatively load a TILE_SIZE × TILE_SIZE patch of A and B from global memory into shared memory. Each thread loads exactly one element — a single global memory read.
3

Synchronise before computing

Call __syncthreads() to ensure every thread has finished writing to shared memory before any thread begins reading from it.
4

Accumulate the partial dot product

Each thread iterates over the TILE_SIZE elements in its tile, reading from shared memory (fast on-chip access). It accumulates into a register variable.
5

Repeat for the next tile, then write output

After all tile steps are complete, each thread writes its accumulated result to global memory once. This is the only global memory write.
The conceptual kernel structure looks like this:
#define TILE_SIZE 16

__global__ void tiled_matmul(float* A, float* B, float* C,
                              int M, int K, int N) {
    __shared__ float tile_A[TILE_SIZE][TILE_SIZE];
    __shared__ float tile_B[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float val = 0.0f;  // register accumulator

    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Cooperatively load tile into shared memory
        if (row < M && t * TILE_SIZE + threadIdx.x < K)
            tile_A[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            tile_A[threadIdx.y][threadIdx.x] = 0.0f;

        if (col < N && t * TILE_SIZE + threadIdx.y < K)
            tile_B[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tile_B[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();

        // Accumulate from shared memory (fast)
        for (int k = 0; k < TILE_SIZE; k++)
            val += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];

        __syncthreads();  // Prevent next tile load from overwriting current tile
    }

    if (row < M && col < N)
        C[row * N + col] = val;  // One global write per output element
}
The second __syncthreads() at the end of the tile loop is easy to forget but critical. Without it, fast threads could begin loading the next tile into shared memory before slow threads have finished reading from the current tile — producing silent data corruption.

Connection to the CPU Transformer Code

Understanding where GPU memory concepts map onto the CPU transformer kernels studied earlier helps build a mental bridge between the two execution models. The indexing math stays the same — only the physical location of that memory, and the cost of accessing it, changes.
In the CPU transformer (llm.c study), matrices are allocated with new float[] — plain heap memory that the CPU accesses through its cache hierarchy. When these kernels are ported to the GPU, those allocations become cudaMalloc calls and the pointers point into global memory. The indexing math (m * N + n, b * T * C + t * C + c, etc.) is unchanged — what changes is where that memory physically lives and how fast it is to access.
The full memory hierarchy, summarised:
MemoryScopeSizeLatencyDeclared with
RegisterPer-threadSmall, finite per SMLowest (~1 cycle)Automatic (local vars)
SharedPer-blockOn-chip SRAM, limited per blockLow (on-chip)__shared__
GlobalAll threadsGBs (device DRAM)High (hundreds of cycles)cudaMalloc
The golden rule: minimise global memory reads. Every element you load from global memory should ideally be reused as many times as possible — first by staging it in shared memory, then by keeping running totals in registers. The tiled matmul above demonstrates both: one global load per element per tile, then TILE_SIZE reuses entirely from shared memory.
CUDA Mode Lectures 1–3 are part of the reading for this module alongside PMPP Chapters 1–3. Work through them after the hands-on exercises — they provide practical context for the memory hierarchy and execution model concepts covered across this GPU fundamentals section.

Build docs developers (and LLMs) love