GPU Memory Hierarchy: Global, Shared, and Register Memory

Writing a correct CUDA kernel is the easy part. Writing a fast one almost always comes down to memory. The GPU has vast computational throughput, but that throughput is wasted the moment threads are sitting idle waiting for data to arrive from slow memory. Understanding which memory tier to use — and designing your access patterns accordingly — is the difference between a kernel that is compute-bound (doing real work) and one that is bandwidth-bound (waiting on memory transfers). PMPP Chapter 3 builds this understanding from the ground up.

The Three Memory Tiers

CUDA exposes three distinct memory spaces to a kernel. Each has a different scope, size, and latency profile.

Global Memory
Shared Memory
Registers

Global memory is the large DRAM pool on the GPU board — several gigabytes on a card like the RTX 4070. It is accessible by every thread in every block across the entire grid, and it persists for the lifetime of the CUDA context. All cudaMalloc allocations and host-to-device transfers land here.The cost: global memory latency is very high — PMPP Chapter 3 characterises it as hundreds of cycles. A thread that loads a value from global memory and immediately uses it will stall waiting for the data to arrive. The GPU hides some of this latency by switching to another ready warp, but when all warps are waiting on memory, the compute units go idle.

// Global memory access — potentially hundreds of cycles of latency
float val = A[m * K + k];  // A lives in global memory (cudaMalloc'd)

Shared memory is an on-chip SRAM bank associated with each Streaming Multiprocessor. It is shared by all threads in a block (not the entire grid) and is dramatically faster than global memory — PMPP Chapter 3 describes the latency gap as roughly two orders of magnitude.Shared memory is declared with the __shared__ qualifier and is typically used as a staging buffer: threads cooperatively load a tile of data from global memory into shared memory, synchronise, and then perform their computation reading from the fast on-chip buffer instead of slow DRAM.The constraint: each block has a fixed shared memory budget. Tile sizes must be chosen to stay within the limit reported by deviceQuery for your specific GPU.

__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];

// Load from global into shared — done once per tile
tile_A[threadIdx.y][threadIdx.x] = A[row * K + (tile * TILE_SIZE + threadIdx.x)];
tile_B[threadIdx.y][threadIdx.x] = B[(tile * TILE_SIZE + threadIdx.y) * N + col];

__syncthreads();  // Ensure all threads have loaded before any compute

Registers are the fastest storage available. Each thread has its own private set — there is no sharing across threads. Access is essentially zero-latency: a value in a register can be consumed on the very next instruction.The catch: registers are scarce. Each SM has a finite register file divided among all threads currently resident. Using too many registers per thread reduces occupancy — fewer warps can live on the SM simultaneously, leaving less work available to hide memory latency.Ordinary local variables in a kernel automatically live in registers:

float val = 0.0f;  // Lives in a register — free to read/write
for (int k = 0; k < K; k++) {
    val += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
}
// val is written to global memory exactly once, after all K iterations
C[row * N + col] = val;

The Accumulator Pattern — Registers Doing Real Work

The matmul.cpp inner loop from the C kernel module uses exactly this pattern:

float val = (bias != nullptr) ? bias[n] : 0.0f;
for (int k = 0; k < K; k++) {
    val += A[m * K + k] * B[k * N + n];
}
out[m * N + n] = val;

val is a local scalar that accumulates the dot product across all K steps. On a CPU this lives on the stack; in a CUDA kernel it lives in a register. The critical observation: global memory is touched exactly twice per output element — K reads from A and K reads from B, with one write to C — while the accumulation itself happens entirely in the register file at zero memory cost. Moving the write to C outside the inner loop is not just style; it is a deliberate choice to keep the hot path in fast registers.

Tiled Matrix Multiplication with Shared Memory

The naive matmul kernel shown in the execution model page reads A and B directly from global memory on every iteration of the k loop. For a matrix of size M × K × N, this means:

Each element of A is read N times (once per output column).
Each element of B is read M times (once per output row).

The total global memory traffic is O(M × K × N) — far more than the data actually present. Tiling eliminates this redundancy by loading a small block (tile) of A and B into shared memory once and reusing it across all the threads in a block.

Partition the output matrix into tiles

Divide C into TILE_SIZE × TILE_SIZE sub-matrices. Assign one block of threads to compute each tile.

Iterate over tiles along the K dimension

For each tile step along K, all threads in the block cooperatively load a TILE_SIZE × TILE_SIZE patch of A and B from global memory into shared memory. Each thread loads exactly one element — a single global memory read.

Synchronise before computing

Call __syncthreads() to ensure every thread has finished writing to shared memory before any thread begins reading from it.

Accumulate the partial dot product

Each thread iterates over the TILE_SIZE elements in its tile, reading from shared memory (fast on-chip access). It accumulates into a register variable.

Repeat for the next tile, then write output

After all tile steps are complete, each thread writes its accumulated result to global memory once. This is the only global memory write.

The conceptual kernel structure looks like this:

#define TILE_SIZE 16

__global__ void tiled_matmul(float* A, float* B, float* C,
                              int M, int K, int N) {
    __shared__ float tile_A[TILE_SIZE][TILE_SIZE];
    __shared__ float tile_B[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float val = 0.0f;  // register accumulator

    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Cooperatively load tile into shared memory
        if (row < M && t * TILE_SIZE + threadIdx.x < K)
            tile_A[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            tile_A[threadIdx.y][threadIdx.x] = 0.0f;

        if (col < N && t * TILE_SIZE + threadIdx.y < K)
            tile_B[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tile_B[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();

        // Accumulate from shared memory (fast)
        for (int k = 0; k < TILE_SIZE; k++)
            val += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];

        __syncthreads();  // Prevent next tile load from overwriting current tile
    }

    if (row < M && col < N)
        C[row * N + col] = val;  // One global write per output element
}

The second __syncthreads() at the end of the tile loop is easy to forget but critical. Without it, fast threads could begin loading the next tile into shared memory before slow threads have finished reading from the current tile — producing silent data corruption.

Connection to the CPU Transformer Code

Understanding where GPU memory concepts map onto the CPU transformer kernels studied earlier helps build a mental bridge between the two execution models. The indexing math stays the same — only the physical location of that memory, and the cost of accessing it, changes.

In the CPU transformer (llm.c study), matrices are allocated with new float[] — plain heap memory that the CPU accesses through its cache hierarchy. When these kernels are ported to the GPU, those allocations become cudaMalloc calls and the pointers point into global memory. The indexing math (m * N + n, b * T * C + t * C + c, etc.) is unchanged — what changes is where that memory physically lives and how fast it is to access.

The full memory hierarchy, summarised:

Memory	Scope	Size	Latency	Declared with
Register	Per-thread	Small, finite per SM	Lowest (~1 cycle)	Automatic (local vars)
Shared	Per-block	On-chip SRAM, limited per block	Low (on-chip)	`__shared__`
Global	All threads	GBs (device DRAM)	High (hundreds of cycles)	`cudaMalloc`

The golden rule: minimise global memory reads. Every element you load from global memory should ideally be reused as many times as possible — first by staging it in shared memory, then by keeping running totals in registers. The tiled matmul above demonstrates both: one global load per element per tile, then TILE_SIZE reuses entirely from shared memory.

CUDA Mode Lectures 1–3 are part of the reading for this module alongside PMPP Chapters 1–3. Work through them after the hands-on exercises — they provide practical context for the memory hierarchy and execution model concepts covered across this GPU fundamentals section.

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

GPU Memory Hierarchy: Global, Shared, and Register Memory

The Three Memory Tiers

The Accumulator Pattern — Registers Doing Real Work

Tiled Matrix Multiplication with Shared Memory

Connection to the CPU Transformer Code

Build docs developers (and LLMs) love

Overview

C++ Core

LLM Kernels in C

GPU Fundamentals

Transformer in C++

Documentation Index

​The Three Memory Tiers

​The Accumulator Pattern — Registers Doing Real Work

​Tiled Matrix Multiplication with Shared Memory

​Connection to the CPU Transformer Code

Build docs developers (and LLMs) love

The Three Memory Tiers

The Accumulator Pattern — Registers Doing Real Work

Tiled Matrix Multiplication with Shared Memory

Connection to the CPU Transformer Code