Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Training and serving large AI models increasingly requires spreading computation across multiple GPUs — but coordinating that work has historically meant wrestling with NCCL collectives and PyTorch distributed APIs that operate at a very different abstraction level from Triton kernels. Iris bridges that gap by extending Triton’s programming model to span multiple devices natively. This guide is based on Lecture 78 by Muhammad Awad, Muhammad Osama, and Brandon Potter.

The Challenge of Multi-GPU Triton Programming

Standard Triton gives you fine-grained control over a single GPU’s SMs, shared memory, and warp scheduling. But when a tensor no longer fits in one GPU’s HBM — or when compute demand exceeds one device — you need to partition both data and computation. Today, the typical stack looks like this:
PyTorch model code
    │  torch.distributed / FSDP / Tensor Parallel

NCCL collectives  (AllReduce, AllGather, ReduceScatter …)
    │  launches separate kernels per device

Per-device Triton kernel
The problems with this approach are:
  • Kernel launch overhead: collective + compute are separate kernels, each with their own launch cost and synchronisation barrier.
  • Impedance mismatch: Triton reasons about tiles and programs; NCCL reasons about full tensors and MPI-style ranks. Writing custom overlapping logic (compute while communicating) is extremely difficult.
  • No shared abstraction: improving communication patterns requires changes at both the NCCL level and the Triton level with no unified view.

What Iris Adds to Triton

Iris introduces a small set of multi-GPU abstractions directly inside the Triton programming model, letting a single kernel description span multiple devices.

Device grid

Extends Triton’s program grid with a device dimension. A program can be assigned to a specific GPU by its device index, just as it is assigned to an SM by its program ID.

Partitioned tensors

Tensors can be declared as partitioned across devices along one or more dimensions. Iris tracks which slice lives on which device.

Cross-device loads/stores

tl.load and tl.store are extended to transparently handle inter-device transfers when a program needs data that lives on a different GPU.

Collective primitives

AllReduce, AllGather, ReduceScatter, and point-to-point send/receive are exposed as first-class Triton operations, interleaved naturally with compute.

Partitioning Computation Across Devices

In standard Triton, the grid determines how many programs run and on which SM. Iris adds a device dimension to that grid.

Single-GPU baseline

@triton.jit
def matmul_k(a_ptr, b_ptr, c_ptr, m, n, k, ...):
    pid_m = tl.program_id(0)   # row tile
    pid_n = tl.program_id(1)   # col tile
    # ... compute one (bm × bn) tile of C

grid = (triton.cdiv(M, BM), triton.cdiv(N, BN))
matmul_k[grid](A, B, C, M, N, K, ...)

Multi-GPU with Iris

With Iris, the output matrix C is sharded across num_gpus along the row dimension. Each device owns M // num_gpus rows:
@iris.jit
def matmul_k(a_ptr, b_ptr, c_ptr, m, n, k, num_gpus, ...):
    device_id = iris.device_id()             # which GPU am I on?
    pid_m     = tl.program_id(0)
    pid_n     = tl.program_id(1)

    # Each device handles a horizontal shard of the output
    rows_per_device = m // num_gpus
    global_pid_m    = device_id * (rows_per_device // BM) + pid_m

    # Compute the tile as before, using global_pid_m for row addressing
    ...

# Iris launches the grid on all devices simultaneously
grid = (triton.cdiv(M // num_gpus, BM), triton.cdiv(N, BN))
iris_grid = iris.Grid(grid, devices=list(range(num_gpus)))
matmul_k[iris_grid](A_shards, B, C_shards, M, N, K, num_gpus, ...)
The key insight is that the partition strategy is expressed inside the kernel, not in a separate distributed wrapper, so the compiler can reason about communication needs holistically.

Communication Primitives (Collective Ops in Triton)

Iris exposes collective operations as iris.* intrinsics that can be called from within a kernel body. Because they live at the same level as tl.load and tl.dot, the compiler can overlap them with compute and avoid separate kernel launches.

AllReduce

@iris.jit
def allreduce_add_k(x_ptr, out_ptr, n, BLOCK_SIZE: tl.constexpr):
    pid     = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask    = offsets < n

    x = tl.load(x_ptr + offsets, mask=mask)

    # Sum x across all devices, result broadcast to every device
    x_reduced = iris.all_reduce(x, op='sum')

    tl.store(out_ptr + offsets, x_reduced, mask=mask)

ReduceScatter + AllGather (ring-style)

@iris.jit
def fused_reduce_scatter_k(x_ptr, out_ptr, n, num_gpus,
                            BLOCK_SIZE: tl.constexpr):
    pid       = tl.program_id(0)
    device_id = iris.device_id()
    offsets   = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask      = offsets < (n // num_gpus)

    x = tl.load(x_ptr + offsets, mask=mask)

    # Each device receives the reduced sum for its shard
    x_scattered = iris.reduce_scatter(x, op='sum', axis=0)

    tl.store(out_ptr + offsets, x_scattered, mask=mask)
Iris uses NVLink or PCIe for data movement depending on the hardware topology. On systems with NVLink (e.g., H100 NVL), the cross-device bandwidth is high enough that fusing communication with compute is almost always beneficial.

Programming Model Comparison: Iris vs. NCCL + PyTorch

AspectNCCL + PyTorch distributedIris
Abstraction levelWhole-tensor, MPI-style ranksTile/block, program-level
Compute/comm overlapManual, with is_async=True groupsAutomatic via compiler scheduling
Kernel boundariesOne kernel per op, separate comm kernelSingle fused kernel possible
Custom partitioningVia tensor slicing in PythonExpressed inside the kernel
Debuggingtorch.distributed primitivesExtended Triton interpreter
MaturityProduction-ready, widely deployedResearch/preview stage
Iris is a research project at the time of Lecture 78. Its API and stability guarantees are experimental. For production multi-GPU workloads, NCCL + PyTorch Distributed or FSDP remains the recommended path.

Use Cases

Large Model Inference

When a single model layer’s weight matrix does not fit in one GPU’s HBM, tensor parallelism is required. Iris allows the attention and FFN kernels to operate on their local shards and perform the AllReduce/AllGather in the same kernel invocation:
1

Shard the weight matrix

Split the weight matrix column-wise (or row-wise) across GPUs. Each GPU holds a (d_model, d_model // num_gpus) slice.
2

Local matmul

Each GPU computes output_shard = input @ weight_shard — a standard Triton matmul on its local data.
3

AllReduce in-kernel

Iris performs the sum reduction across devices to produce the final output, without returning to Python.

Distributed Training

In data-parallel training, each device processes a different mini-batch and gradients must be averaged before the optimizer step:
Forward pass   (per device, local data)

Backward pass  (per device, local gradients)

AllReduce      (average gradients across devices)  ←-- Iris fuses this

Optimizer step (per device, averaged gradients)
Iris can fuse the AllReduce into the gradient computation kernel itself, hiding communication latency behind arithmetic.

Pipeline Parallelism

For very deep models, pipeline parallelism assigns different layers to different GPUs. The inter-stage activations become point-to-point send/receive operations that Iris can express as iris.send / iris.recv within the forward-pass kernel.

Performance Considerations

Communication-to-compute ratio

Iris is most beneficial when the communication volume is small relative to compute. For small batch sizes where AllReduce dominates, the overhead may not be worth the complexity.

NVLink vs. PCIe

On NVLink-connected GPUs (A100/H100 HGX), bandwidth is ~600 GB/s vs. ~32 GB/s over PCIe. Fusing communication with compute is far more impactful on NVLink systems.

Tile size and alignment

The communication granularity is determined by Triton tile sizes. Misaligned tiles can cause unnecessary data movement. Match your tile size to the shard boundaries.

Software pipelining

The Triton compiler’s num_stages parameter controls how many tiles are in flight. With Iris, stages can overlap compute on one tile with communication of the next.

Measuring Efficiency

Use Nsight Systems to verify that compute and communication actually overlap:
nsys profile --trace=cuda,nvtx python multi_gpu_kernel.py
Look for cuLaunchKernel and NCCL/NVLink transfer events on the same timeline. With Iris, you should see them interleaved rather than sequential.

Further Reading

Lecture 17: GPU Collective Communication (NCCL)

Background on AllReduce, rings, and tree-reduction algorithms.

Practitioner's Guide to Triton (Lecture 14)

Single-GPU Triton kernel writing — the prerequisite for Iris.

Triton Compiler Internals (Lecture 29)

How Triton lowers Python to PTX; relevant to understanding Iris’s compiler extensions.

Lecture 67: NCCL & NVSHMEM

Deep dive into NVSHMEM for PGAS-style GPU programming — a related approach.

Build docs developers (and LLMs) love