Iris: Multi-GPU Programming in Triton

Training and serving large AI models increasingly requires spreading computation across multiple GPUs — but coordinating that work has historically meant wrestling with NCCL collectives and PyTorch distributed APIs that operate at a very different abstraction level from Triton kernels. Iris bridges that gap by extending Triton’s programming model to span multiple devices natively. This guide is based on Lecture 78 by Muhammad Awad, Muhammad Osama, and Brandon Potter.

The Challenge of Multi-GPU Triton Programming

Standard Triton gives you fine-grained control over a single GPU’s SMs, shared memory, and warp scheduling. But when a tensor no longer fits in one GPU’s HBM — or when compute demand exceeds one device — you need to partition both data and computation. Today, the typical stack looks like this:

PyTorch model code
    │  torch.distributed / FSDP / Tensor Parallel
    ▼
NCCL collectives  (AllReduce, AllGather, ReduceScatter …)
    │  launches separate kernels per device
    ▼
Per-device Triton kernel

The problems with this approach are:

Kernel launch overhead: collective + compute are separate kernels, each with their own launch cost and synchronisation barrier.
Impedance mismatch: Triton reasons about tiles and programs; NCCL reasons about full tensors and MPI-style ranks. Writing custom overlapping logic (compute while communicating) is extremely difficult.
No shared abstraction: improving communication patterns requires changes at both the NCCL level and the Triton level with no unified view.

What Iris Adds to Triton

Iris introduces a small set of multi-GPU abstractions directly inside the Triton programming model, letting a single kernel description span multiple devices.

Device grid

Extends Triton’s program grid with a device dimension. A program can be assigned to a specific GPU by its device index, just as it is assigned to an SM by its program ID.

Partitioned tensors

Tensors can be declared as partitioned across devices along one or more dimensions. Iris tracks which slice lives on which device.

Cross-device loads/stores

tl.load and tl.store are extended to transparently handle inter-device transfers when a program needs data that lives on a different GPU.

Collective primitives

AllReduce, AllGather, ReduceScatter, and point-to-point send/receive are exposed as first-class Triton operations, interleaved naturally with compute.

Partitioning Computation Across Devices

In standard Triton, the grid determines how many programs run and on which SM. Iris adds a device dimension to that grid.

Single-GPU baseline

@triton.jit
def matmul_k(a_ptr, b_ptr, c_ptr, m, n, k, ...):
    pid_m = tl.program_id(0)   # row tile
    pid_n = tl.program_id(1)   # col tile
    # ... compute one (bm × bn) tile of C

grid = (triton.cdiv(M, BM), triton.cdiv(N, BN))
matmul_k[grid](A, B, C, M, N, K, ...)

Multi-GPU with Iris

With Iris, the output matrix C is sharded across num_gpus along the row dimension. Each device owns M // num_gpus rows:

@iris.jit
def matmul_k(a_ptr, b_ptr, c_ptr, m, n, k, num_gpus, ...):
    device_id = iris.device_id()             # which GPU am I on?
    pid_m     = tl.program_id(0)
    pid_n     = tl.program_id(1)

    # Each device handles a horizontal shard of the output
    rows_per_device = m // num_gpus
    global_pid_m    = device_id * (rows_per_device // BM) + pid_m

    # Compute the tile as before, using global_pid_m for row addressing
    ...

# Iris launches the grid on all devices simultaneously
grid = (triton.cdiv(M // num_gpus, BM), triton.cdiv(N, BN))
iris_grid = iris.Grid(grid, devices=list(range(num_gpus)))
matmul_k[iris_grid](A_shards, B, C_shards, M, N, K, num_gpus, ...)

The key insight is that the partition strategy is expressed inside the kernel, not in a separate distributed wrapper, so the compiler can reason about communication needs holistically.

Communication Primitives (Collective Ops in Triton)

Iris exposes collective operations as iris.* intrinsics that can be called from within a kernel body. Because they live at the same level as tl.load and tl.dot, the compiler can overlap them with compute and avoid separate kernel launches.

AllReduce

@iris.jit
def allreduce_add_k(x_ptr, out_ptr, n, BLOCK_SIZE: tl.constexpr):
    pid     = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask    = offsets < n

    x = tl.load(x_ptr + offsets, mask=mask)

    # Sum x across all devices, result broadcast to every device
    x_reduced = iris.all_reduce(x, op='sum')

    tl.store(out_ptr + offsets, x_reduced, mask=mask)

ReduceScatter + AllGather (ring-style)

@iris.jit
def fused_reduce_scatter_k(x_ptr, out_ptr, n, num_gpus,
                            BLOCK_SIZE: tl.constexpr):
    pid       = tl.program_id(0)
    device_id = iris.device_id()
    offsets   = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask      = offsets < (n // num_gpus)

    x = tl.load(x_ptr + offsets, mask=mask)

    # Each device receives the reduced sum for its shard
    x_scattered = iris.reduce_scatter(x, op='sum', axis=0)

    tl.store(out_ptr + offsets, x_scattered, mask=mask)

Iris uses NVLink or PCIe for data movement depending on the hardware topology. On systems with NVLink (e.g., H100 NVL), the cross-device bandwidth is high enough that fusing communication with compute is almost always beneficial.

Programming Model Comparison: Iris vs. NCCL + PyTorch

Aspect	NCCL + PyTorch distributed	Iris
Abstraction level	Whole-tensor, MPI-style ranks	Tile/block, program-level
Compute/comm overlap	Manual, with `is_async=True` groups	Automatic via compiler scheduling
Kernel boundaries	One kernel per op, separate comm kernel	Single fused kernel possible
Custom partitioning	Via tensor slicing in Python	Expressed inside the kernel
Debugging	`torch.distributed` primitives	Extended Triton interpreter
Maturity	Production-ready, widely deployed	Research/preview stage

Iris is a research project at the time of Lecture 78. Its API and stability guarantees are experimental. For production multi-GPU workloads, NCCL + PyTorch Distributed or FSDP remains the recommended path.

Use Cases

Large Model Inference

When a single model layer’s weight matrix does not fit in one GPU’s HBM, tensor parallelism is required. Iris allows the attention and FFN kernels to operate on their local shards and perform the AllReduce/AllGather in the same kernel invocation:

Shard the weight matrix

Split the weight matrix column-wise (or row-wise) across GPUs. Each GPU holds a (d_model, d_model // num_gpus) slice.

Local matmul

Each GPU computes output_shard = input @ weight_shard — a standard Triton matmul on its local data.

AllReduce in-kernel

Iris performs the sum reduction across devices to produce the final output, without returning to Python.

Distributed Training

In data-parallel training, each device processes a different mini-batch and gradients must be averaged before the optimizer step:

Forward pass   (per device, local data)
    │
Backward pass  (per device, local gradients)
    │
AllReduce      (average gradients across devices)  ←-- Iris fuses this
    │
Optimizer step (per device, averaged gradients)

Iris can fuse the AllReduce into the gradient computation kernel itself, hiding communication latency behind arithmetic.

Pipeline Parallelism

For very deep models, pipeline parallelism assigns different layers to different GPUs. The inter-stage activations become point-to-point send/receive operations that Iris can express as iris.send / iris.recv within the forward-pass kernel.

Performance Considerations

Communication-to-compute ratio

Iris is most beneficial when the communication volume is small relative to compute. For small batch sizes where AllReduce dominates, the overhead may not be worth the complexity.

NVLink vs. PCIe

On NVLink-connected GPUs (A100/H100 HGX), bandwidth is ~600 GB/s vs. ~32 GB/s over PCIe. Fusing communication with compute is far more impactful on NVLink systems.

Tile size and alignment

The communication granularity is determined by Triton tile sizes. Misaligned tiles can cause unnecessary data movement. Match your tile size to the shard boundaries.

Software pipelining

The Triton compiler’s num_stages parameter controls how many tiles are in flight. With Iris, stages can overlap compute on one tile with communication of the next.

Measuring Efficiency

Use Nsight Systems to verify that compute and communication actually overlap:

nsys profile --trace=cuda,nvtx python multi_gpu_kernel.py

Look for cuLaunchKernel and NCCL/NVLink transfer events on the same timeline. With Iris, you should see them interleaved rather than sequential.

Lecture 17: GPU Collective Communication (NCCL)

Background on AllReduce, rings, and tree-reduction algorithms.

Practitioner's Guide to Triton (Lecture 14)

Single-GPU Triton kernel writing — the prerequisite for Iris.

Triton Compiler Internals (Lecture 29)

How Triton lowers Python to PTX; relevant to understanding Iris’s compiler extensions.

Lecture 67: NCCL & NVSHMEM

Deep dive into NVSHMEM for PGAS-style GPU programming — a related approach.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Iris: Multi-GPU Programming in Triton

The Challenge of Multi-GPU Triton Programming

What Iris Adds to Triton

Device grid

Partitioned tensors

Cross-device loads/stores

Collective primitives

Partitioning Computation Across Devices

Single-GPU baseline

Multi-GPU with Iris

Communication Primitives (Collective Ops in Triton)

AllReduce

ReduceScatter + AllGather (ring-style)

Programming Model Comparison: Iris vs. NCCL + PyTorch

Use Cases

Large Model Inference

Distributed Training

Pipeline Parallelism

Performance Considerations

Communication-to-compute ratio

NVLink vs. PCIe

Tile size and alignment

Software pipelining

Measuring Efficiency

Further Reading

Lecture 17: GPU Collective Communication (NCCL)

Practitioner's Guide to Triton (Lecture 14)

Triton Compiler Internals (Lecture 29)

Lecture 67: NCCL & NVSHMEM

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​The Challenge of Multi-GPU Triton Programming

​What Iris Adds to Triton

Device grid

Partitioned tensors

Cross-device loads/stores

Collective primitives

​Partitioning Computation Across Devices

​Single-GPU baseline

​Multi-GPU with Iris

​Communication Primitives (Collective Ops in Triton)

​AllReduce

​ReduceScatter + AllGather (ring-style)

​Programming Model Comparison: Iris vs. NCCL + PyTorch

​Use Cases

​Large Model Inference

​Distributed Training

​Pipeline Parallelism

​Performance Considerations

Communication-to-compute ratio

NVLink vs. PCIe

Tile size and alignment

Software pipelining

​Measuring Efficiency

​Further Reading

Lecture 17: GPU Collective Communication (NCCL)

Practitioner's Guide to Triton (Lecture 14)

Triton Compiler Internals (Lecture 29)

Lecture 67: NCCL & NVSHMEM

Build docs developers (and LLMs) love

The Challenge of Multi-GPU Triton Programming

What Iris Adds to Triton

Partitioning Computation Across Devices

Single-GPU baseline

Multi-GPU with Iris

Communication Primitives (Collective Ops in Triton)

AllReduce

ReduceScatter + AllGather (ring-style)

Programming Model Comparison: Iris vs. NCCL + PyTorch

Use Cases

Large Model Inference

Distributed Training

Pipeline Parallelism

Performance Considerations

Measuring Efficiency

Further Reading