Use this file to discover all available pages before exploring further.
Tensor Cores are NVIDIA’s specialized matrix-multiply-accumulate (MMA) units that deliver an order-of-magnitude higher throughput than regular CUDA cores for matrix operations. Unlocking their full performance requires going beyond torch.matmul and working with CUTLASS and CuTE — NVIDIA’s composable building blocks for high-performance GEMM. This page covers Lectures 15 (Eric Auld), 23 (Vijay Thakkar and Pradeep Ramani), 57 (Cris Cecka), and related lectures 86 and 103–104.
Introduced in Volta (2017), Tensor Cores are fixed-function hardware units that compute a small matrix multiply-accumulate (MMA) in a single clock cycle. Each generation adds new precision support and tile sizes:
Generation
Architecture
Tile size
Precisions
1st gen
Volta (V100)
4×4×4
FP16
2nd gen
Turing (T4)
16×16×16, 8×16×16
FP16, INT8, INT4
3rd gen
Ampere (A100)
16×16×16
FP16, BF16, TF32, FP64, INT8
4th gen
Hopper (H100)
64×16×16 (WGMMA)
FP16, BF16, FP8
The practical implication: an A100 delivers 77.6 TFLOPS for FP16 using CUDA cores, but 312 TFLOPS for FP16 using Tensor Cores — a 4× multiplier for the same watt budget.
Tensor Cores compute exact results (no approximation) on the input precision. The output accumulator is typically FP32 even when inputs are FP16, preventing precision loss during summation.
The WMMA (Warp Matrix Multiply-Accumulate) API is the CUDA C++ interface for Tensor Cores, available since CUDA 9.0. It operates at the warp level — all 32 threads in a warp cooperate on a single MMA tile.
#include <mma.h>using namespace nvcuda::wmma;// Define fragments for a 16x16x16 FP16 MMAfragment<matrix_a, 16, 16, 16, half, row_major> frag_a;fragment<matrix_b, 16, 16, 16, half, col_major> frag_b;fragment<accumulator, 16, 16, 16, float> frag_c;// Initialize accumulator to zerofill_fragment(frag_c, 0.0f);// Load tiles from shared memory (all 32 threads cooperate)load_matrix_sync(frag_a, smem_a_ptr, lda);load_matrix_sync(frag_b, smem_b_ptr, ldb);// Execute MMA: frag_c += frag_a * frag_bmma_sync(frag_c, frag_a, frag_b, frag_c);// Store result to global memorystore_matrix_sync(output_ptr, frag_c, ldc, mem_row_major);
The WMMA API distributes fragment data across the 32 threads in the warp in an implementation-defined way. You cannot directly index into a fragment — use load_matrix_sync / store_matrix_sync only. Accessing fragment elements directly is non-portable.
CUTLASS (CUDA Templates for Linear Algebra Subroutines and Solvers) is NVIDIA’s open-source C++ template library for high-performance matrix operations. It provides a layered abstraction:
Device-level GEMM (handles dispatch, problem decomposition) ↓Threadblock-level MMA (schedules tiles across the SM) ↓Warp-level MMA (coordinates warp-level Tensor Core operations) ↓Thread-level MMA (WMMA or PTX mma instructions)
Each layer is independently composable. You can mix and match tile sizes, pipeline depths, epilogues, and data layouts without rewriting the full kernel.A minimal CUTLASS 2.x GEMM:
CuTE (covered in Lecture 57 by Cris Cecka) is the abstraction layer introduced with CUTLASS 3.x that unifies how tensors, layouts, and tiling are expressed in GPU code.The core insight: almost all GEMM complexity comes from managing tensor layouts, tile boundaries, and index arithmetic. CuTE provides a small algebra to express all of this uniformly.
CUTLASS 3.x restructures the GEMM around CuTE abstractions. The kernel is expressed as a sequence of tiled MMA operations with explicit pipeline stages:
CUTLASS 3.x with Hopper (H100) uses WGMMA (warpgroup MMA) instructions and TMA (Tensor Memory Accelerator) for asynchronous data movement. The pipeline abstraction in CollectiveMma handles the overlap between data loading and computation automatically.
Lecture 86 by Vicki Wang introduces CuTeDSL — a Python-level domain-specific language built on CuTE that lets you write and prototype CUTLASS 3.x kernels in Python, with JIT compilation to GPU code. This is particularly useful for rapid experimentation with tile configurations and MMA shapes.
# CuTeDSL (Python DSL for CuTE)from cutedsl import ...@cutedsl.kerneldef gemm_kernel(A, B, C, alpha, beta): # Express tiling and MMA in Python with CuTE semantics mma = TiledMMA(shape=(16, 8, 16), dtype=float16) tile_a = partition_src(A, mma) tile_b = partition_src(B, mma) tile_c = partition_dst(C, mma) for k in range(K // 16): gemm(mma, tile_a[k], tile_b[k], tile_c)