Tensor Cores, CUTLASS, and CuTE Layout Algebra

Tensor Cores are NVIDIA’s specialized matrix-multiply-accumulate (MMA) units that deliver an order-of-magnitude higher throughput than regular CUDA cores for matrix operations. Unlocking their full performance requires going beyond torch.matmul and working with CUTLASS and CuTE — NVIDIA’s composable building blocks for high-performance GEMM. This page covers Lectures 15 (Eric Auld), 23 (Vijay Thakkar and Pradeep Ramani), 57 (Cris Cecka), and related lectures 86 and 103–104.

What Tensor Cores are

Introduced in Volta (2017), Tensor Cores are fixed-function hardware units that compute a small matrix multiply-accumulate (MMA) in a single clock cycle. Each generation adds new precision support and tile sizes:

Generation	Architecture	Tile size	Precisions
1st gen	Volta (V100)	4×4×4	FP16
2nd gen	Turing (T4)	16×16×16, 8×16×16	FP16, INT8, INT4
3rd gen	Ampere (A100)	16×16×16	FP16, BF16, TF32, FP64, INT8
4th gen	Hopper (H100)	64×16×16 (WGMMA)	FP16, BF16, FP8

The practical implication: an A100 delivers 77.6 TFLOPS for FP16 using CUDA cores, but 312 TFLOPS for FP16 using Tensor Cores — a 4× multiplier for the same watt budget.

Tensor Cores compute exact results (no approximation) on the input precision. The output accumulator is typically FP32 even when inputs are FP16, preventing precision loss during summation.

WMMA API: low-level Tensor Core access

The WMMA (Warp Matrix Multiply-Accumulate) API is the CUDA C++ interface for Tensor Cores, available since CUDA 9.0. It operates at the warp level — all 32 threads in a warp cooperate on a single MMA tile.

#include <mma.h>
using namespace nvcuda::wmma;

// Define fragments for a 16x16x16 FP16 MMA
fragment<matrix_a, 16, 16, 16, half, row_major>    frag_a;
fragment<matrix_b, 16, 16, 16, half, col_major>    frag_b;
fragment<accumulator, 16, 16, 16, float>            frag_c;

// Initialize accumulator to zero
fill_fragment(frag_c, 0.0f);

// Load tiles from shared memory (all 32 threads cooperate)
load_matrix_sync(frag_a, smem_a_ptr, lda);
load_matrix_sync(frag_b, smem_b_ptr, ldb);

// Execute MMA: frag_c += frag_a * frag_b
mma_sync(frag_c, frag_a, frag_b, frag_c);

// Store result to global memory
store_matrix_sync(output_ptr, frag_c, ldc, mem_row_major);

The WMMA API distributes fragment data across the 32 threads in the warp in an implementation-defined way. You cannot directly index into a fragment — use load_matrix_sync / store_matrix_sync only. Accessing fragment elements directly is non-portable.

CUTLASS library overview

CUTLASS (CUDA Templates for Linear Algebra Subroutines and Solvers) is NVIDIA’s open-source C++ template library for high-performance matrix operations. It provides a layered abstraction:

Device-level GEMM       (handles dispatch, problem decomposition)
    ↓
Threadblock-level MMA   (schedules tiles across the SM)
    ↓
Warp-level MMA          (coordinates warp-level Tensor Core operations)
    ↓
Thread-level MMA        (WMMA or PTX mma instructions)

Each layer is independently composable. You can mix and match tile sizes, pipeline depths, epilogues, and data layouts without rewriting the full kernel. A minimal CUTLASS 2.x GEMM:

#include <cutlass/gemm/device/gemm.h>

using Gemm = cutlass::gemm::device::Gemm<
    cutlass::half_t,                        // ElementA
    cutlass::layout::RowMajor,              // LayoutA
    cutlass::half_t,                        // ElementB
    cutlass::layout::ColumnMajor,           // LayoutB
    float,                                  // ElementOutput
    cutlass::layout::RowMajor,              // LayoutOutput
    float,                                  // ElementAccumulator
    cutlass::arch::OpClassTensorOp,         // Use Tensor Cores
    cutlass::arch::Sm80                     // Target A100
>;

Gemm gemm_op;
Gemm::Arguments args{
    {M, N, K},
    {A, lda}, {B, ldb}, {C, ldc}, {D, ldd},
    {alpha, beta}
};
gemm_op(args);

CuTE: Composable Universal Tensor Extensions

CuTE (covered in Lecture 57 by Cris Cecka) is the abstraction layer introduced with CUTLASS 3.x that unifies how tensors, layouts, and tiling are expressed in GPU code. The core insight: almost all GEMM complexity comes from managing tensor layouts, tile boundaries, and index arithmetic. CuTE provides a small algebra to express all of this uniformly.

Layout algebra basics

A CuTE layout is a pair (Shape, Stride) that maps multi-dimensional indices to a flat memory offset:

offset = Σ (index_i * stride_i)

Layouts compose: you can tile a layout, permute its dimensions, or slice it, and the result is still a valid layout.

#include <cute/layout.hpp>
using namespace cute;

// A 4x8 matrix stored row-major: shape=(4,8), stride=(8,1)
auto layout = make_layout(make_shape(4, 8), make_stride(8, 1));

// The same matrix, column-major: shape=(4,8), stride=(1,4)
auto layout_cm = make_layout(make_shape(4, 8), make_stride(1, 4));

// Tile the layout into 2x2 tiles
auto tiled = zipped_divide(layout, make_shape(2, 2));
// tiled has shape ((2,2),(2,4)) — tile shape × number of tiles

Tiling and partitioning

The power of CuTE’s layout algebra is that tiling becomes a pure layout transformation — no explicit index arithmetic needed:

// Partition a 128x64 shared memory tile across a 2x4 warp grid
// Each warp gets a 64x16 subtile
auto smem_layout = make_layout(make_shape(128, 64));
auto warp_layout = make_layout(make_shape(2, 4));  // 2x4 warp grid

auto warp_tile = logical_divide(smem_layout, warp_layout);
// warp_tile maps warp_id → starting offset in smem

Building a GEMM with CUTLASS 3.x

CUTLASS 3.x restructures the GEMM around CuTE abstractions. The kernel is expressed as a sequence of tiled MMA operations with explicit pipeline stages:

#include <cutlass/gemm/collective/collective_mma.hpp>
#include <cutlass/epilogue/collective/collective_epilogue.hpp>

// Define the collective MMA operation (mainloop)
using CollectiveMma = cutlass::gemm::collective::CollectiveMma<
    cutlass::gemm::MainloopSm80CpAsync<3>,  // 3-stage async pipeline
    cutlass::gemm::TileShape<128, 128, 32>, // Threadblock tile
    cutlass::half_t, cutlass::layout::RowMajor,
    cutlass::half_t, cutlass::layout::ColumnMajor,
    float,
    cutlass::gemm::TiledMma<...>            // Warp-level MMA config
>;

// Define the epilogue (applies alpha/beta scaling, writes output)
using CollectiveEpilogue = cutlass::epilogue::collective::DefaultEpilogue<...>;

// Assemble the full kernel
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
    cutlass::gemm::ProblemShape<M, N, K>,
    CollectiveMma,
    CollectiveEpilogue
>;

CUTLASS 3.x with Hopper (H100) uses WGMMA (warpgroup MMA) instructions and TMA (Tensor Memory Accelerator) for asynchronous data movement. The pipeline abstraction in CollectiveMma handles the overlap between data loading and computation automatically.

CuTeDSL

Lecture 86 by Vicki Wang introduces CuTeDSL — a Python-level domain-specific language built on CuTE that lets you write and prototype CUTLASS 3.x kernels in Python, with JIT compilation to GPU code. This is particularly useful for rapid experimentation with tile configurations and MMA shapes.

# CuTeDSL (Python DSL for CuTE)
from cutedsl import ...

@cutedsl.kernel
def gemm_kernel(A, B, C, alpha, beta):
    # Express tiling and MMA in Python with CuTE semantics
    mma = TiledMMA(shape=(16, 8, 16), dtype=float16)
    
    tile_a = partition_src(A, mma)
    tile_b = partition_src(B, mma)
    tile_c = partition_dst(C, mma)
    
    for k in range(K // 16):
        gemm(mma, tile_a[k], tile_b[k], tile_c)

Lecture 15: CUTLASS

Eric Auld’s introduction to CUTLASS concepts and layout algebra

Lecture 23: Tensor Cores

Vijay Thakkar and Pradeep Ramani on WMMA, WGMMA, and Hopper MMA

Lecture 36: Flash Attention 3

Jay Shah on using CUTLASS + Tensor Cores for FA3 on H100

Lecture 57: CuTE

Cris Cecka on CuTE layout algebra and composable tensor abstractions

Lecture 86: CuTeDSL

Vicki Wang on the Python DSL for writing CuTE kernels

Lectures 103–104: Layout algebra

Jack Carlisle and Jay Shah on the category-theoretic foundations of CuTE layouts

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Tensor Cores, CUTLASS, and CuTE Layout Algebra

What Tensor Cores are

WMMA API: low-level Tensor Core access

CUTLASS library overview

CuTE: Composable Universal Tensor Extensions

Layout algebra basics

Tiling and partitioning

Building a GEMM with CUTLASS 3.x

CuTeDSL

Lecture 15: CUTLASS

Lecture 23: Tensor Cores

Lecture 36: Flash Attention 3

Lecture 57: CuTE

Lecture 86: CuTeDSL

Lectures 103–104: Layout algebra

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​What Tensor Cores are

​WMMA API: low-level Tensor Core access

​CUTLASS library overview

​CuTE: Composable Universal Tensor Extensions

​Layout algebra basics

​Tiling and partitioning

​Building a GEMM with CUTLASS 3.x

​CuTeDSL

​Related lectures

Lecture 15: CUTLASS

Lecture 23: Tensor Cores

Lecture 36: Flash Attention 3

Lecture 57: CuTE

Lecture 86: CuTeDSL

Lectures 103–104: Layout algebra

Build docs developers (and LLMs) love

What Tensor Cores are

WMMA API: low-level Tensor Core access

CUTLASS library overview

CuTE: Composable Universal Tensor Extensions

Layout algebra basics

Tiling and partitioning

Building a GEMM with CUTLASS 3.x

CuTeDSL

Related lectures