Lecture 8 by Mark Saroufim distills GPU kernel optimization into a concrete checklist. The core principle is simple: always profile first, never guess. Every item on this checklist has a measurable signal — a tool output, a counter, or a metric that tells you whether you have a problem and how severe it is. The code for this lecture lives in theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
lecture_008/ folder of the GPU Mode lectures repository.
Understand your hardware first
Before you can reason about performance, you need to know what your GPU is capable of. Theprint_cuda_info() function from utils.py gives you the baseline:
Compiling and profiling
The lecture’sREADME.md shows the minimal workflow:
ncu (Nsight Compute) gives you per-kernel metrics: achieved bandwidth, achieved throughput, occupancy, stall reasons, and more. Always use ncu or nsys (Nsight Systems) before drawing any conclusions about a kernel’s performance.
The performance checklist
1. Profile first — identify the actual bottleneck
1. Profile first — identify the actual bottleneck
Run Key questions to answer before touching any code:
nsys for a system-level timeline (which kernels take the most time, PCIe transfers, synchronization gaps) and ncu for per-kernel hardware metrics.- Which kernel takes the most wall-clock time?
- Is it memory-bound or compute-bound?
- What is the achieved vs. theoretical bandwidth/throughput?
2. Check memory access patterns (coalescing)
2. Check memory access patterns (coalescing)
Non-coalesced global memory access is one of the most common and impactful bottlenecks. From In
lecture_008/coalesce.cu:ncu, look at l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum — a high sector count relative to the number of requests indicates poor coalescing.3. Check occupancy
3. Check occupancy
Low occupancy means the GPU doesn’t have enough warps to hide memory latency. From Common causes of low occupancy:
lecture_008/occupancy.cu, the CUDA API can suggest an optimal block size:- Too many registers per thread — reduce register usage or use
__launch_bounds__ - Too much shared memory per block — fewer blocks can fit per SM
- Block size too small — use at least 128 threads per block (4 warps)
4. Check arithmetic intensity vs. roofline
4. Check arithmetic intensity vs. roofline
Compute your kernel’s arithmetic intensity:Compare it to your GPU’s ridge point (peak FLOPs ÷ peak bandwidth). If you are below the ridge point, more compute won’t help — you need to reduce memory traffic or increase reuse via tiling and shared memory.For an A100 SXM4:
- FP32 peak: 19.5 TFLOP/s
- Memory bandwidth: 2.0 TB/s
- Ridge point: ~9.75 FLOP/byte
5. Check warp divergence
5. Check warp divergence
Branches where threads in the same warp take different paths are serialized. From In
lecture_008/divergence.cu:ncu, look at smsp__thread_inst_executed_pred_on.avg.pct_of_peak_sustained_active — low values indicate divergence.6. Check launch overhead and kernel fusion
6. Check launch overhead and kernel fusion
Each kernel launch has fixed overhead (~5–10 µs on CPU side). If your pipeline launches many small kernels, the launch overhead accumulates. Look for opportunities to fuse kernels: combine two or more sequential operations into one kernel, sharing the data load.Common fusion opportunities:
- Activation function + bias add after a linear layer
- Layer norm + dropout
- Any element-wise operation after a compute-heavy kernel
nsys to identify sequences of short kernels with large CPU gaps between them.7. Check host↔device data transfers
7. Check host↔device data transfers
PCIe bandwidth (~32 GB/s) is ~60–90× slower than GPU memory bandwidth (~2 TB/s on HBM). Every unnecessary
cudaMemcpy is expensive.Strategies to minimize transfers:- Keep tensors on the GPU between operations
- Use
torch.cuda.Streamto overlap computation and transfers - Use pinned (page-locked) memory for faster transfers:
cudaMallocHost - Profile with
nsys— PCIe transfers show as DMA events in the timeline
8. Check thread coarsening opportunities
8. Check thread coarsening opportunities
Thread coarsening lets each thread do more work, reducing the overhead of block/thread management and improving arithmetic intensity. From Coarsening is most effective when a kernel is latency-bound due to too many threads competing for shared resources.
lecture_008/coarsening.cu:Quick reference table
| Issue | Signal | Fix |
|---|---|---|
| Memory bandwidth bottleneck | ncu achieved BW < theoretical | Improve coalescing, use shared memory tiling |
| Low occupancy | ncu occupancy < 50% | Reduce registers, reduce shared memory, increase block size |
| Warp divergence | ncu low thread utilization | Rewrite branches as branchless arithmetic |
| Kernel launch overhead | nsys many short kernels | Fuse kernels |
| PCIe transfer overhead | nsys DMA events between kernels | Keep data on GPU, use async transfers |
| Compute-bound | ncu SM utilization near 100% | Use tensor cores, reduce FLOPs, use lower precision |
Lecture references
Lecture 8 code
coalesce.cu, divergence.cu, occupancy.cu, coarsening.cu, and more
Lecture 8 slides
CUDA Performance Checklist slides by Mark Saroufim