Writing high-performance GPU kernels — the kind that saturate hardware and outperform hand-tuned libraries — requires deep knowledge of GPU architecture and a significant amount of low-level code. ScaleML Lecture 75 is split between two speakers: William Brandon delivers a GPU programming fundamentals refresher, and Simran Arora introduces ThunderKittens, a domain-specific language (DSL) for writing efficient attention kernels at a higher level of abstraction. This lecture is part of the GPU Mode ScaleML Series.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
Part 1: GPU programming fundamentals (William Brandon)
Before writing any kernel, you need a working mental model of GPU hardware. This section covers the hierarchy from threads to hardware, and how attention kernels map onto it.GPU memory hierarchy
Modern NVIDIA GPUs expose three levels of memory that a kernel programmer must understand:Global (HBM)
High-bandwidth memory on the GPU die. Terabytes per second of bandwidth, but high latency (~400–800 cycles). Every tensor you allocate with
torch.empty(..., device='cuda') lives here.Shared memory (SRAM)
On-chip scratchpad shared by all threads in a block. Very low latency (~5 ns) and very high bandwidth, but tiny — typically 48 KB to 228 KB per SM depending on configuration. The key resource for tiling algorithms.
Registers
Per-thread private storage. The fastest possible memory — register reads have no latency overhead. Limited to a few hundred per thread; spilling to global memory is catastrophically expensive.
FlashAttention achieves its speedup primarily by tiling the attention computation so that the Q, K, V tiles fit in shared memory. This avoids repeated round-trips to global memory for the intermediate attention matrix, which is the bottleneck in naive attention implementations.
Warps, blocks, and the execution model
Threads are the basic unit of execution, but they execute in lockstep groups of 32 called warps:- A warp is 32 threads that execute the same instruction simultaneously (SIMT)
- A block (also called a cooperative thread array, CTA) is a group of up to 1024 threads that share an SM and its shared memory
- A grid is the full set of blocks launched by a single kernel call
Writing an attention kernel from scratch
A naive attention kernel suffers from reading and writing the full attention matrix to global memory. Here is the structure of a tiled implementation that avoids this:flash_attention_inner function implements the online softmax trick from FlashAttention: it maintains running statistics (max and denominator ) that let you update the output accumulator incrementally without materializing the full attention row.
Part 2: ThunderKittens (Simran Arora)
Writing the kernel above correctly — with correct memory access patterns, tensor core utilization, and pipeline overlap — takes thousands of lines of careful CUDA C++. ThunderKittens is a DSL that makes this tractable.What ThunderKittens is
ThunderKittens (TK) is a C++ library developed at Stanford and Together AI that provides tile-level abstractions for writing GPU kernels. Instead of reasoning about individual threads and bytes, you write operations on tiles — rectangular blocks of data that map directly onto warp-level hardware primitives (tensor core operations, shared memory loads, register files).Tile types
ThunderKittens has three fundamental tile types, each mapping to a different level of the memory hierarchy:rt (register tile)
A tile of values distributed across the registers of threads in a warp. This is the primary compute tile — tensor core MMA operations consume and produce register tiles.
st (shared tile)
A tile of values in shared memory, accessible by all threads in a block. Used for staging data between global memory and registers.
gl (global layout)
A descriptor for data in global memory, parameterized by shape and stride. Used for structured loads and stores.
Core operations: load, store, MMA
TK provides three categories of operations that map directly to hardware: Load and store move data between the memory hierarchy levels:Writing Flash Attention with ThunderKittens
Here is a simplified but representative TK kernel for Flash Attention forward pass:Performance: how TK achieves near-cuBLAS speeds
ThunderKittens achieves high performance through several mechanisms:Tensor core alignment
All register tiles are sized and aligned to match the hardware’s MMA instruction dimensions (16×8×16 for bfloat16 on Ampere/Hopper). No reshape or copy is needed before issuing tensor core instructions.
Warp-level abstraction
TK operations map to single warps. This makes it easy to pipeline warps using
warpgroup abstractions, overlapping memory loads from one warp with compute in another.Asynchronous copies
Shared memory loads use
cp.async instructions (exposed via TK’s load for shared tiles), overlapping memory transfers with compute from previous tiles.When to use ThunderKittens vs. raw CUDA vs. Triton
Use ThunderKittens when...
Use ThunderKittens when...
- You are writing a new attention variant or custom matmul-style kernel
- You need tensor core utilization and correct warp-level data layout
- You want near-hardware-peak performance without writing raw PTX or assembly
- You are doing kernel research and need to iterate quickly on new designs
Use raw CUDA when...
Use raw CUDA when...
- You need fine-grained control over instruction scheduling or PTX-level optimizations
- You are writing non-attention kernels (reductions, scans, custom elementwise ops)
- You need to interface directly with CUTLASS or cuBLAS primitives
- You are targeting architectures where TK does not yet have support
Use Triton when...
Use Triton when...
- You are a Python programmer who wants GPU performance without C++
- Your kernel is embarrassingly parallel and fits the Triton tile model
- You want easy multi-backend support (NVIDIA, AMD, Intel via the Triton backend ecosystem)
- You do not need the final ~10% of performance that raw CUDA/TK can provide
ThunderKittens is actively developed and is specifically designed for the post-Hopper GPU generation where tensor core throughput is the dominant performance lever. If you are writing custom attention mechanisms for research, TK is currently one of the most practical ways to get hardware-competitive implementations without a team of CUDA experts.
Lecture references
GPU Programming Fundamentals slides
William Brandon’s slides on GPU programming fundamentals (Lecture 75, Part 1)
ThunderKittens slides
Simran Arora’s ThunderKittens slides (ThunderKittens.pdf in the lecture_075 folder)
Simran Arora
Speaker homepage — research on efficient ML systems and custom hardware kernels
GPU Mode YouTube
Full lecture recordings on the GPU Mode YouTube channel