Training and serving large AI models increasingly requires spreading computation across multiple GPUs — but coordinating that work has historically meant wrestling with NCCL collectives and PyTorch distributed APIs that operate at a very different abstraction level from Triton kernels. Iris bridges that gap by extending Triton’s programming model to span multiple devices natively. This guide is based on Lecture 78 by Muhammad Awad, Muhammad Osama, and Brandon Potter.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
The Challenge of Multi-GPU Triton Programming
Standard Triton gives you fine-grained control over a single GPU’s SMs, shared memory, and warp scheduling. But when a tensor no longer fits in one GPU’s HBM — or when compute demand exceeds one device — you need to partition both data and computation. Today, the typical stack looks like this:- Kernel launch overhead: collective + compute are separate kernels, each with their own launch cost and synchronisation barrier.
- Impedance mismatch: Triton reasons about tiles and programs; NCCL reasons about full tensors and MPI-style ranks. Writing custom overlapping logic (compute while communicating) is extremely difficult.
- No shared abstraction: improving communication patterns requires changes at both the NCCL level and the Triton level with no unified view.
What Iris Adds to Triton
Iris introduces a small set of multi-GPU abstractions directly inside the Triton programming model, letting a single kernel description span multiple devices.Device grid
Extends Triton’s program grid with a device dimension. A program can be assigned to a specific GPU by its device index, just as it is assigned to an SM by its program ID.
Partitioned tensors
Tensors can be declared as partitioned across devices along one or more dimensions. Iris tracks which slice lives on which device.
Cross-device loads/stores
tl.load and tl.store are extended to transparently handle inter-device transfers when a program needs data that lives on a different GPU.Collective primitives
AllReduce, AllGather, ReduceScatter, and point-to-point send/receive are exposed as first-class Triton operations, interleaved naturally with compute.
Partitioning Computation Across Devices
In standard Triton, the grid determines how many programs run and on which SM. Iris adds a device dimension to that grid.Single-GPU baseline
Multi-GPU with Iris
With Iris, the output matrixC is sharded across num_gpus along the row dimension. Each device owns M // num_gpus rows:
Communication Primitives (Collective Ops in Triton)
Iris exposes collective operations asiris.* intrinsics that can be called from within a kernel body. Because they live at the same level as tl.load and tl.dot, the compiler can overlap them with compute and avoid separate kernel launches.
AllReduce
ReduceScatter + AllGather (ring-style)
Iris uses NVLink or PCIe for data movement depending on the hardware topology. On systems with NVLink (e.g., H100 NVL), the cross-device bandwidth is high enough that fusing communication with compute is almost always beneficial.
Programming Model Comparison: Iris vs. NCCL + PyTorch
| Aspect | NCCL + PyTorch distributed | Iris |
|---|---|---|
| Abstraction level | Whole-tensor, MPI-style ranks | Tile/block, program-level |
| Compute/comm overlap | Manual, with is_async=True groups | Automatic via compiler scheduling |
| Kernel boundaries | One kernel per op, separate comm kernel | Single fused kernel possible |
| Custom partitioning | Via tensor slicing in Python | Expressed inside the kernel |
| Debugging | torch.distributed primitives | Extended Triton interpreter |
| Maturity | Production-ready, widely deployed | Research/preview stage |
Use Cases
Large Model Inference
When a single model layer’s weight matrix does not fit in one GPU’s HBM, tensor parallelism is required. Iris allows the attention and FFN kernels to operate on their local shards and perform the AllReduce/AllGather in the same kernel invocation:Shard the weight matrix
Split the weight matrix column-wise (or row-wise) across GPUs. Each GPU holds a
(d_model, d_model // num_gpus) slice.Local matmul
Each GPU computes
output_shard = input @ weight_shard — a standard Triton matmul on its local data.Distributed Training
In data-parallel training, each device processes a different mini-batch and gradients must be averaged before the optimizer step:Pipeline Parallelism
For very deep models, pipeline parallelism assigns different layers to different GPUs. The inter-stage activations become point-to-point send/receive operations that Iris can express asiris.send / iris.recv within the forward-pass kernel.
Performance Considerations
Communication-to-compute ratio
Iris is most beneficial when the communication volume is small relative to compute. For small batch sizes where AllReduce dominates, the overhead may not be worth the complexity.
NVLink vs. PCIe
On NVLink-connected GPUs (A100/H100 HGX), bandwidth is ~600 GB/s vs. ~32 GB/s over PCIe. Fusing communication with compute is far more impactful on NVLink systems.
Tile size and alignment
The communication granularity is determined by Triton tile sizes. Misaligned tiles can cause unnecessary data movement. Match your tile size to the shard boundaries.
Software pipelining
The Triton compiler’s
num_stages parameter controls how many tiles are in flight. With Iris, stages can overlap compute on one tile with communication of the next.Measuring Efficiency
Use Nsight Systems to verify that compute and communication actually overlap:cuLaunchKernel and NCCL/NVLink transfer events on the same timeline. With Iris, you should see them interleaved rather than sequential.
Further Reading
Lecture 17: GPU Collective Communication (NCCL)
Background on AllReduce, rings, and tree-reduction algorithms.
Practitioner's Guide to Triton (Lecture 14)
Single-GPU Triton kernel writing — the prerequisite for Iris.
Triton Compiler Internals (Lecture 29)
How Triton lowers Python to PTX; relevant to understanding Iris’s compiler extensions.
Lecture 67: NCCL & NVSHMEM
Deep dive into NVSHMEM for PGAS-style GPU programming — a related approach.