GPU Compute and Memory Architecture Explained

Lecture 4 by Thomas Viehmann covers the hardware reality that underpins every performance decision you make in CUDA. Most kernel optimization work comes down to one thing: getting data to the compute units fast enough. This page covers the GPU memory hierarchy, bandwidth vs. compute throughput, and the practical patterns that exploit memory locality.

The GPU memory hierarchy

Modern GPUs expose a multi-level memory hierarchy. Each level is faster, smaller, and closer to the compute units than the one below it.

Registers

Per-thread. ~0 cycles access latency. Finite: typically 255 registers per thread, ~65,536 per SM. Spilling to local memory maps back to global memory.

Shared memory / L1 cache

Per-block. ~5 ns access. 32–96 KB per SM (configurable split between L1 and shared). Programmer-managed. Used for tiling and reductions.

L2 cache

Per-GPU (shared across all SMs). ~100–200 ns. Several MB (e.g., 40 MB on H100). Managed by hardware. Reduces repeated global memory traffic.

Global memory (HBM/GDDR)

The main GPU DRAM. 16–80+ GB. ~400–800 cycles latency. 1–3 TB/s bandwidth (HBM). All threads in all blocks can access it. The bottleneck for most kernels.

Constant & texture memory

Specialized read-only caches. Constant memory is optimized for broadcast (all threads read the same address). Texture memory has a spatial locality cache.

Memory bandwidth vs. compute throughput

Most machine learning kernels are memory-bound: the GPU spends more time waiting for data than computing. The ratio of arithmetic operations to bytes loaded from memory is called arithmetic intensity.

Arithmetic intensity = FLOPs / bytes transferred

A simple element-wise operation like output = input * 2.0 loads 4 bytes and performs 1 FLOP — arithmetic intensity of 0.25 FLOP/byte. A matrix multiplication of large matrices approaches hundreds of FLOPs per byte and is compute-bound.

The roofline model plots achievable performance as a function of arithmetic intensity. Below the “ridge point” you are memory-bound; above it you are compute-bound. Understanding where your kernel sits on the roofline tells you which resource to optimize.

Roofline intuition

Peak performance (TFLOP/s)
        |              /‾‾‾‾‾‾‾‾‾‾‾ compute bound
        |            /
        |          /
        |        /   ← ridge point
        |      /
        |    /  memory bound
        |  /
        |/________________________
              arithmetic intensity (FLOP/byte)

For an A100:

Peak FP32 throughput: ~19.5 TFLOP/s
Peak memory bandwidth: ~2 TB/s
Ridge point: ~9.75 FLOP/byte

If your kernel’s arithmetic intensity is below ~9.75 FLOP/byte, you cannot saturate the compute units — more memory bandwidth or better locality is the fix.

Coalesced memory access

Global memory is accessed in 128-byte cache lines. When threads in a warp access memory, the hardware coalesces adjacent accesses into as few cache line transactions as possible. Non-coalesced access wastes bandwidth.

__global__ void copyDataCoalesced(float *in, float *out, int n) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        out[index] = in[index];  // thread 0 → in[0], thread 1 → in[1], etc.
    }
}

In the coalesced version, 32 consecutive threads access 32 consecutive floats (128 bytes) — exactly one cache line per warp. In the non-coalesced version, the strided pattern touches twice as many cache lines for the same amount of useful data.

For 2D data like matrices stored in row-major order, column-major access patterns (iterating down a column) are non-coalesced. Transposing a matrix naively is a classic example — shared memory tiling is the standard fix.

Shared memory: on-chip scratch space

Shared memory is the key tool for avoiding repeated global memory loads. The pattern is:

Cooperatively load a tile of data from global memory into shared memory
Synchronize with __syncthreads()
Compute using the fast on-chip copy
Write results back to global memory

Shared memory reduction example

#define BLOCK_DIM 1024

__global__ void SharedMemoryReduction(float* input, float* output) {
    __shared__ float input_s[BLOCK_DIM];
    unsigned int t = threadIdx.x;

    // Load from global into shared memory
    input_s[t] = input[t] + input[t + BLOCK_DIM];

    // Tree reduction entirely in shared memory
    for (unsigned int stride = blockDim.x / 2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (threadIdx.x < stride) {
            input_s[t] += input_s[t + stride];
        }
    }

    if (threadIdx.x == 0) {
        *output = input_s[0];
    }
}

This kernel loads data once from global memory into input_s, then performs the entire reduction in shared memory — far faster than going back to global memory at every step.

Shared memory bank conflicts

Shared memory is divided into 32 banks (one per warp lane). Simultaneous accesses to the same bank by different threads in a warp are serialized — this is a bank conflict.

No bank conflict (fastest)

Each thread in the warp accesses a different bank. E.g., thread t accesses shared[t] — adjacent threads hit adjacent banks. Full bandwidth.

2-way bank conflict

Two threads access the same bank. The access is serialized into 2 sub-transactions — half bandwidth.

Broadcast (free)

All threads access the same address. The hardware broadcasts the value to all threads in a single transaction — no penalty.

The typical fix for bank conflicts in matrix transpose and reduction kernels is to add a padding column: __shared__ float tile[BLOCK][BLOCK + 1]. The +1 shifts each row to a different set of banks.

Occupancy and register pressure

Occupancy is the ratio of active warps on an SM to the maximum number of warps the SM can support. Higher occupancy gives the scheduler more warps to hide latency with. Occupancy is limited by three resources, whichever is most constraining:

Registers per thread — more registers per kernel = fewer threads can be in flight
Shared memory per block — larger shared memory allocations = fewer blocks per SM
Block size — blocks must be a multiple of 32 (warp size); small blocks reduce occupancy

// Query occupancy for a kernel at runtime
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize,
    &blockSize,
    myKernel,
    0,   // dynamic shared memory per block
    0    // block size limit (0 = no limit)
);

Use NVIDIA’s Nsight Compute (ncu) to see the occupancy of your kernel and exactly which resource is the bottleneck. The “Occupancy” section will show register usage, shared memory usage, and the theoretical vs. achieved occupancy.

Practical memory optimization checklist

Ensure coalesced global memory access

Check that consecutive threads in a warp access consecutive memory addresses. Use ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum to count load sectors and compare against ideal.

Use shared memory for reuse

If the same global memory location is read more than once (by different threads, or in a loop), load it into shared memory first. The break-even point is roughly when each byte is read 2+ times.

Avoid shared memory bank conflicts

Check your indexing patterns. Add padding (+1 to the second dimension) if threads in a warp access the same bank.

Tune block size for occupancy

Use cudaOccupancyMaxPotentialBlockSize or the CUDA Occupancy Calculator to find the block size that maximizes occupancy for your kernel’s register and shared memory usage.

Minimize host↔device transfers

Data transfers over PCIe are ~10–50× slower than global memory bandwidth. Batch transfers, use pinned memory (cudaMallocHost), or keep data on the GPU between kernel calls.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

GPU Compute and Memory Architecture Explained

The GPU memory hierarchy

Registers

Shared memory / L1 cache

L2 cache

Global memory (HBM/GDDR)

Constant & texture memory

Memory bandwidth vs. compute throughput

Roofline intuition

Coalesced memory access

Shared memory: on-chip scratch space

Shared memory reduction example

Shared memory bank conflicts

Occupancy and register pressure

Practical memory optimization checklist

Lecture references

Lecture 4 materials

Reductions deep-dive

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​The GPU memory hierarchy

Registers

Shared memory / L1 cache

L2 cache

Global memory (HBM/GDDR)

Constant & texture memory

​Memory bandwidth vs. compute throughput

​Roofline intuition

​Coalesced memory access

​Shared memory: on-chip scratch space

​Shared memory reduction example

​Shared memory bank conflicts

​Occupancy and register pressure

​Practical memory optimization checklist

​Lecture references

Lecture 4 materials

Reductions deep-dive

Build docs developers (and LLMs) love

The GPU memory hierarchy

Memory bandwidth vs. compute throughput

Roofline intuition

Coalesced memory access

Shared memory: on-chip scratch space

Shared memory reduction example

Shared memory bank conflicts

Occupancy and register pressure

Practical memory optimization checklist

Lecture references