CUDA Performance Checklist for Kernel Optimization

Lecture 8 by Mark Saroufim distills GPU kernel optimization into a concrete checklist. The core principle is simple: always profile first, never guess. Every item on this checklist has a measurable signal — a tool output, a counter, or a metric that tells you whether you have a problem and how severe it is. The code for this lecture lives in the lecture_008/ folder of the GPU Mode lectures repository.

Understand your hardware first

Before you can reason about performance, you need to know what your GPU is capable of. The print_cuda_info() function from utils.py gives you the baseline:

import torch
import subprocess

def print_cuda_info():
    print("=== PyTorch CUDA Info ===")
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version()}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"    Current device: {torch.cuda.current_device()}")
        print(f"    Memory allocated: {torch.cuda.memory_allocated(i)/1e6:.2f} MB")
        print(f"    Memory cached   : {torch.cuda.memory_reserved(i)/1e6:.2f} MB")

    print("\n=== nvidia-smi Info (if available) ===")
    try:
        subprocess.run(["nvidia-smi"], check=True)
    except Exception as e:
        print(f"nvidia-smi not available: {e}")

Call this before profiling any kernel. You need to know your GPU’s peak FLOP/s and memory bandwidth to interpret profiler outputs — the roofline model requires both numbers.

Compiling and profiling

The lecture’s README.md shows the minimal workflow:

nvcc -o benchmark file.cu
ncu benchmark

ncu (Nsight Compute) gives you per-kernel metrics: achieved bandwidth, achieved throughput, occupancy, stall reasons, and more. Always use ncu or nsys (Nsight Systems) before drawing any conclusions about a kernel’s performance.

Never optimize without profiling. Intuition about GPU bottlenecks is frequently wrong. A kernel that “looks” slow because of a nested loop may actually be memory-bound, and restructuring the loop will change nothing.

The performance checklist

1. Profile first — identify the actual bottleneck

Run nsys for a system-level timeline (which kernels take the most time, PCIe transfers, synchronization gaps) and ncu for per-kernel hardware metrics.

# System profiling — timeline view
nsys profile --stats=true python train.py

# Kernel profiling — hardware counters
ncu --set full python train.py

Key questions to answer before touching any code:

Which kernel takes the most wall-clock time?
Is it memory-bound or compute-bound?
What is the achieved vs. theoretical bandwidth/throughput?

2. Check memory access patterns (coalescing)

Non-coalesced global memory access is one of the most common and impactful bottlenecks. From lecture_008/coalesce.cu:

// Coalesced: thread i reads in[i] — adjacent threads, adjacent addresses
__global__ void copyDataCoalesced(float *in, float *out, int n) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        out[index] = in[index];
    }
}

// Non-coalesced: thread i reads in[(i*2) % n] — strided access pattern
__global__ void copyDataNonCoalesced(float *in, float *out, int n) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        out[index] = in[(index * 2) % n];
    }
}

In ncu, look at l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum — a high sector count relative to the number of requests indicates poor coalescing.

3. Check occupancy

Low occupancy means the GPU doesn’t have enough warps to hide memory latency. From lecture_008/occupancy.cu, the CUDA API can suggest an optimal block size:

int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize, &blockSize, copyDataCoalesced, 0, 0
);
// Prints recommended block size and minimum grid size
std::cout << "Recommended block size: " << blockSize
          << ", Minimum grid size: " << minGridSize << std::endl;

Common causes of low occupancy:

Too many registers per thread — reduce register usage or use __launch_bounds__
Too much shared memory per block — fewer blocks can fit per SM
Block size too small — use at least 128 threads per block (4 warps)

4. Check arithmetic intensity vs. roofline

Compute your kernel’s arithmetic intensity:

arithmetic intensity = FLOPs / bytes_transferred

Compare it to your GPU’s ridge point (peak FLOPs ÷ peak bandwidth). If you are below the ridge point, more compute won’t help — you need to reduce memory traffic or increase reuse via tiling and shared memory.For an A100 SXM4:

FP32 peak: 19.5 TFLOP/s
Memory bandwidth: 2.0 TB/s
Ridge point: ~9.75 FLOP/byte

5. Check warp divergence

Branches where threads in the same warp take different paths are serialized. From lecture_008/divergence.cu:

// Divergent: half the warp takes each branch
__global__ void processArrayWithDivergence(int *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        if (data[idx] % 2 == 0) {
            data[idx] = data[idx] * 2;       // even branch
        } else {
            data[idx] = data[idx] + 1;       // odd branch
        }
    }
}

// No divergence: branchless equivalent
__global__ void processArrayWithoutDivergence(int *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        int isEven = !(data[idx] % 2);
        data[idx] = isEven * (data[idx] * 2) + (!isEven) * (data[idx] + 1);
    }
}

In ncu, look at smsp__thread_inst_executed_pred_on.avg.pct_of_peak_sustained_active — low values indicate divergence.

6. Check launch overhead and kernel fusion

Each kernel launch has fixed overhead (~5–10 µs on CPU side). If your pipeline launches many small kernels, the launch overhead accumulates. Look for opportunities to fuse kernels: combine two or more sequential operations into one kernel, sharing the data load.Common fusion opportunities:

Activation function + bias add after a linear layer
Layer norm + dropout
Any element-wise operation after a compute-heavy kernel

Use nsys to identify sequences of short kernels with large CPU gaps between them.

7. Check host↔device data transfers

PCIe bandwidth (~32 GB/s) is ~60–90× slower than GPU memory bandwidth (~2 TB/s on HBM). Every unnecessary cudaMemcpy is expensive.Strategies to minimize transfers:

Keep tensors on the GPU between operations
Use torch.cuda.Stream to overlap computation and transfers
Use pinned (page-locked) memory for faster transfers: cudaMallocHost
Profile with nsys — PCIe transfers show as DMA events in the timeline

# Bad: unnecessary round-trip
x = model(x.cuda()).cpu()
y = another_model(x.cuda())

# Good: stay on GPU
x = x.cuda()
x = model(x)
y = another_model(x)

8. Check thread coarsening opportunities

Thread coarsening lets each thread do more work, reducing the overhead of block/thread management and improving arithmetic intensity. From lecture_008/coarsening.cu:

// Original: one thread per element
__global__ void VecAdd(float* A, float* B, float* C) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) C[i] = A[i] + B[i];
}

// Coarsened: one thread handles two elements
__global__ void VecAddCoarsened(float* A, float* B, float* C) {
    int i = (blockIdx.x * blockDim.x + threadIdx.x) * 2;
    if (i < N)     C[i]     = A[i]     + B[i];
    if (i + 1 < N) C[i + 1] = A[i + 1] + B[i + 1];
}

Coarsening is most effective when a kernel is latency-bound due to too many threads competing for shared resources.

Quick reference table

Issue	Signal	Fix
Memory bandwidth bottleneck	`ncu` achieved BW < theoretical	Improve coalescing, use shared memory tiling
Low occupancy	`ncu` occupancy < 50%	Reduce registers, reduce shared memory, increase block size
Warp divergence	`ncu` low thread utilization	Rewrite branches as branchless arithmetic
Kernel launch overhead	`nsys` many short kernels	Fuse kernels
PCIe transfer overhead	`nsys` DMA events between kernels	Keep data on GPU, use async transfers
Compute-bound	`ncu` SM utilization near 100%	Use tensor cores, reduce FLOPs, use lower precision

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

CUDA Performance Checklist for Kernel Optimization

Understand your hardware first

Compiling and profiling

The performance checklist

Quick reference table

Lecture references

Lecture 8 code

Lecture 8 slides

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​Understand your hardware first

​Compiling and profiling

​The performance checklist

​Quick reference table

​Lecture references

Lecture 8 code

Lecture 8 slides

Build docs developers (and LLMs) love

Understand your hardware first

Compiling and profiling

The performance checklist

Quick reference table

Lecture references