Triton lets you write GPU kernels in Python-like syntax that compile directly to PTX — the same intermediate representation produced by CUDA C++. This guide is based on Lecture 14 by Umer Adil, which walks through the full practitioner workflow: understanding the programming model, writing real kernels, masking out-of-bounds memory, tiling 2-D computations, and automating tile-size selection withDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
@triton.autotune.
Why Triton?
CUDA gives you full control over every SM register, warp shuffle, and shared-memory bank — but that power comes at a steep cost in complexity. Triton occupies the sweet spot between PyTorch convenience and hand-tuned CUDA performance.| CUDA | Triton | torch.compile | |
|---|---|---|---|
| Control | Complete | Block-level | Almost none |
| Performance ceiling | Absolute max | Near-optimal | Good baseline |
| Lines of code | High | Medium | None |
| Shared-memory management | Manual | Automatic | Automatic |
| Debuggable in Python | No | Yes (simulate) | No |
Profile first
If your model is too slow, run it through
torch.compile. It often generates Triton kernels automatically and can be a free win.Reshape for the compiler
Restructure your PyTorch code to be more
torch.compile-friendly (contiguous tensors, fewer dynamic shapes).Write a Triton kernel
Identify the bottleneck operation and replace it with a hand-written Triton kernel. This page shows you how.
torch.compile actually generates Triton kernels internally, so they make an excellent starting point for your own customizations. See Lecture 1 for how to extract them.Core Concepts: Programs, Blocks, and Pointer Arithmetic
The CUDA vs. Triton programming model
In CUDA you decompose work at two levels: blocks (running on an SM) and threads (scalars within a block). Each thread operates on a single value. Triton removes the thread level entirely. Each kernel invocation — called a program — operates on a block of values simultaneously. Triton handles the thread-level decomposition internally, freeing you from manual shared-memory management.Pointer arithmetic
Triton receives raw pointers to the first element of each tensor. To address elementi, you write ptr + i. For a block, tl.arange(0, BLOCK_SIZE) produces the range [0, 1, ..., BLOCK_SIZE-1], so:
tl.load and tl.store with Masking
Because block sizes must be powers of two and tensors may not be perfectly divisible, you almost always need a mask to guard against out-of-bounds accesses.
other value) rather than reading garbage memory. Masked stores are silently ignored.
Writing a Vector Addition Kernel
The following is the realadd_kernel from Lecture 29 (lecture_029/vector_add.py), which Umer Adil also uses as the canonical starter example:
BLOCK_SIZE: tl.constexprtells the compiler this parameter is a compile-time constant, enabling shape inference and loop unrolling.- The grid lambda returns the number of programs to launch.
triton.cdivis ceiling-division:(a + b - 1) // b. compiled_kernel.asmexposes every stage of the compilation pipeline.
Blocked / Tiled Computation for 2-D Problems
For matrix operations (matmul, softmax, etc.) you need 2-D tiling. The idea is to assign each program a 2-D tile(bm × bn) of the output and accumulate partial results over the shared k-dimension.
A run close together in time, improving L2 cache reuse:
Fused Elementwise Kernels
Fusion eliminates intermediate global-memory round-trips. A fused softmax computesexp(x - max(x)) / sum(exp(x - max(x))) in a single pass over each row:
n_cols fits in a single tile; for wide matrices you would accumulate the max/sum in multiple passes.
@triton.autotune for Automatic Tile-Size Tuning
Picking the right block size manually is tedious and hardware-dependent. @triton.autotune runs a grid search over candidate configurations and caches the winner keyed to the problem dimensions.
key list tells Triton which arguments determine the “problem size”. When the values of m, n, or k change, Triton re-benchmarks all configs and picks the fastest one for the new shape.
Debugging Triton Kernels
Interpreter mode
SetTRITON_INTERPRET=1 before importing Triton to run kernels entirely on the CPU. This lets you use standard Python debuggers.
Utility functions from triton_util.py (Lecture 14)
Umer Adil ships a small utility module in the lecture repo that makes debugging far easier. Here is the full source:
TRITON_INTERPRET=1):
Profiling with Nsight Compute
Inspecting compiled artifacts
Built-in Benchmarking
Triton shipstriton.testing for reproducible micro-benchmarks:
Lecture 14 Notebook
Full worked examples including copy, greyscale, and matrix multiply kernels.
Triton Official Tutorials
The vector-add, fused softmax, and matmul tutorials from the Triton team.
Triton Internals (Lecture 29)
Understand what happens after
@triton.jit: AST → MLIR → PTX.Iris: Multi-GPU Triton (Lecture 78)
Extend Triton kernels across multiple GPUs with the Iris programming model.