GPU Collective Communication with NCCL

When you scale training beyond a single GPU, every backward pass requires synchronizing gradients across devices. NCCL — NVIDIA’s Collective Communications Library — is the engine that makes this fast. Lecture 17, presented by Dan Johnson, explains what NCCL is, why distributed training needs it, and how it works under the hood. This page walks through the core concepts and the real code from the lecture.

Why multi-GPU communication is needed

Modern deep learning models and datasets are too large to fit on a single GPU. Two primary strategies split the work:

Data parallelism — each GPU holds a full copy of the model and processes a different mini-batch. Gradients must be averaged across all GPUs before the optimizer step.
Model parallelism — layers or tensor dimensions are split across GPUs. Activations must be passed between devices during the forward and backward passes.

Both strategies require fast, reliable inter-GPU communication. On a single node, GPUs communicate over NVLink or PCIe. Across nodes, they use InfiniBand or Ethernet. NCCL abstracts all of this behind a unified API.

Collective operations

A collective is an operation where every process in a group participates — both sending and receiving data. NCCL implements the following primitives:

AllReduce

Every GPU contributes a tensor. The result (sum, max, etc.) is returned to all GPUs. Used for gradient averaging in DDP.

Broadcast

One GPU sends a tensor to all other GPUs. Used to synchronize model weights at initialization.

Reduce

Every GPU contributes a tensor. The result is sent to one root GPU only.

Scatter

One GPU splits a tensor into equal chunks and sends one chunk to each GPU.

Gather

Each GPU sends its tensor to one root GPU, which concatenates them.

AllGather

Each GPU sends its tensor and all GPUs receive the full concatenated result.

AllReduce is by far the most common collective in data-parallel training. It is equivalent to a Reduce followed by a Broadcast, but NCCL implements it more efficiently using ring or tree algorithms.

NCCL: NVIDIA Collective Communications Library

NCCL (pronounced “nickel”) is a purpose-built library for multi-GPU and multi-node collective communication. Key properties:

Topology-aware: automatically detects NVLink, NVSwitch, PCIe, and network fabrics and selects the best communication path.
Asynchronous: operations are enqueued on CUDA streams so they can overlap with computation.
Backend for PyTorch: when you call dist.init_process_group("nccl"), PyTorch uses NCCL for all GPU tensors.

NCCL is separate from CUDA but ships with most PyTorch distributions. You can query its version with:

import torch
print(torch.cuda.nccl.version())

Ring-AllReduce algorithm

NCCL implements AllReduce using a ring topology. Given N GPUs arranged in a ring:

Scatter-Reduce phase

Each GPU splits its tensor into N chunks. Over N-1 steps, each GPU sends one chunk to its right neighbor and receives one chunk from its left neighbor, accumulating a partial reduction. After this phase, each GPU holds one fully-reduced chunk.

AllGather phase

Over another N-1 steps, each GPU propagates its fully-reduced chunk around the ring. After this phase, every GPU has the complete reduced tensor.

The total data sent per GPU is 2 * (N-1)/N * tensor_size, which approaches 2 * tensor_size as N grows. This makes Ring-AllReduce bandwidth-optimal — the per-GPU communication cost is nearly independent of the number of GPUs.

Tree-based algorithms (recursive halving/doubling) have lower latency for small tensors. NCCL switches between ring and tree strategies automatically based on message size and topology.

PyTorch DDP

DistributedDataParallel (DDP) is PyTorch’s high-level wrapper around NCCL. It handles:

Replicating the model to each GPU rank.
Scattering mini-batches across ranks.
Triggering AllReduce on gradients during the backward pass.
Averaging gradients so every rank applies an identical optimizer step.

Minimal example (`ddp_simple.py`)

This is the exact code from Lecture 17. It demonstrates the core DDP setup with a toy single-parameter model:

# modified from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

import torch
import torch.distributed as dist
import torch.nn as nn
from torch.profiler import profile

from torch.nn.parallel import DistributedDataParallel as DDP


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.w = nn.Parameter(torch.tensor(5.0))

    def forward(self, x):
        return self.w * 7.0 * x


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")

    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    with profile() as prof:
        x = torch.tensor(dist.get_rank(), dtype=torch.float)
        y = ddp_model(x)
        print(f"rank {rank}: y=w*7*x: {y.item()}={ddp_model.module.w.item()}*7*{x.item()}")
        print(f"rank {rank}: dy/dw=7*x: {7.0*x.item()}")
        y.backward()
        print(f"rank {rank}: reduced dy/dw: {ddp_model.module.w.grad.item()}")
    if rank == 0:
        print("exporting trace")
        prof.export_chrome_trace("trace_ddp_simple.json")
    dist.destroy_process_group()


if __name__ == "__main__":
    print("Running")
    demo_basic()

Run it with torchrun:

torchrun --nnodes=1 --nproc_per_node=2 \
  --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 \
  ddp_simple.py

When rank 0 and rank 1 compute gradients, they get dy/dw = 7*0 = 0 and dy/dw = 7*1 = 7 respectively. After AllReduce (sum + divide by 2), both ranks receive the averaged gradient 3.5.

Full training loop example (`ddp_example.py`)

The second file from the lecture adds an optimizer, a loss function, and a 10-step training loop on a realistic model size (4000-dimensional linear layers):

# modified from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

import torch
import torch.distributed as dist
import torch.nn as nn

from torch.nn.parallel import DistributedDataParallel as DDP
from torch.profiler import profile
import torch.optim as optim

SIZE = 4000


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(SIZE, SIZE)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(SIZE, SIZE)
        self.net3 = nn.Linear(SIZE, SIZE)

    def forward(self, x):
        return self.net3(self.relu(self.net2(self.relu(self.net1(x)))))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")

    model = ToyModel().to(rank)
    ddp_model = DDP(model, bucket_cap_mb=25, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    with profile(
        record_shapes=True,
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
    ) as prof:
        for i in range(10):
            optimizer.zero_grad()
            outputs = ddp_model(torch.randn(1000, SIZE, device=rank))
            labels = torch.randn(1000, SIZE, device=rank)
            loss_fn(outputs, labels).backward()
            optimizer.step()
    if rank == 0:
        prof.export_chrome_trace("trace_ddp_example.json")


if __name__ == "__main__":
    demo_basic()

torchrun --nnodes=1 --nproc_per_node=2 \
  --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 \
  ddp_example.py

bucket_cap_mb=25 controls how DDP groups gradients into communication buckets. Larger buckets reduce NCCL call overhead at the cost of delaying gradient overlap. Tune this for your model size and network bandwidth.

Communication/computation overlap

DDP does not wait for the full backward pass to finish before starting AllReduce. Instead, it fires AllReduce on each gradient bucket as soon as the last gradient in that bucket is computed. This overlaps communication with the remaining backward computation. The PyTorch profiler trace exported by both example scripts (e.g. trace_ddp_example.json) makes this overlap visible. Open it in chrome://tracing or Perfetto to see NCCL AllReduce ops running alongside CUDA backward kernels.

If you see AllReduce serialized after the backward pass in your profile, your buckets may be too large. Reduce bucket_cap_mb to allow earlier gradient communication.

NCCL environment variables for tuning

NCCL exposes its behavior through environment variables. The most useful ones:

Variable	Default	Effect
`NCCL_DEBUG`	`""`	Set to `INFO` or `WARN` to enable NCCL logging
`NCCL_DEBUG_SUBSYS`	`""`	Filter logs to a subsystem, e.g. `GRAPH,COLL`
`NCCL_SOCKET_IFNAME`	auto	Force a specific network interface (e.g. `eth0`)
`NCCL_IB_DISABLE`	`0`	Set to `1` to disable InfiniBand and fall back to IP
`NCCL_P2P_DISABLE`	`0`	Set to `1` to disable peer-to-peer GPU transfers
`NCCL_ALGO`	auto	Force an algorithm: `Ring`, `Tree`, or `CollNet`
`NCCL_PROTO`	auto	Force a protocol: `Simple`, `LL`, or `LL128`
`NCCL_BUFFSIZE`	auto	Internal buffer size in bytes

# Example: enable verbose NCCL logging for debugging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=GRAPH,COLL
torchrun --nproc_per_node=4 train.py

Forcing NCCL_ALGO or NCCL_PROTO without profiling first can hurt performance. NCCL’s auto-selection is generally well-tuned for your hardware topology.

Lecture 67: NCCL & NVSHMEM (Jeff Hammond)

Lecture 67 by Jeff Hammond goes deeper into both NCCL and NVSHMEM (NVIDIA’s shared-memory model for multi-GPU programming). NVSHMEM enables GPU threads to directly read and write memory on remote GPUs using a partitioned global address space (PGAS), avoiding explicit send/receive calls.

Lecture 70: Fault-tolerant collectives (mike64_t)

Production distributed training jobs fail — a GPU goes down, a node is preempted, a network link flaps. Lecture 70 covers how to build collective operations that tolerate these failures without restarting the entire job.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

GPU Collective Communication with NCCL

Why multi-GPU communication is needed

Collective operations

AllReduce

Broadcast

Reduce

Scatter

Gather

AllGather

NCCL: NVIDIA Collective Communications Library

Ring-AllReduce algorithm

PyTorch DDP

Minimal example (`ddp_simple.py`)

Full training loop example (`ddp_example.py`)

Communication/computation overlap

NCCL environment variables for tuning

Lecture 67: NCCL & NVSHMEM (Jeff Hammond)

Lecture 70: Fault-tolerant collectives (mike64_t)

Lecture 17 slides

GPU Mode Discord

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​Why multi-GPU communication is needed

​Collective operations

AllReduce

Broadcast

Reduce

Scatter

Gather

AllGather

​NCCL: NVIDIA Collective Communications Library

​Ring-AllReduce algorithm

​PyTorch DDP

​Minimal example (ddp_simple.py)

​Full training loop example (ddp_example.py)

​Communication/computation overlap

​NCCL environment variables for tuning

​Related lectures

​Lecture 67: NCCL & NVSHMEM (Jeff Hammond)

​Lecture 70: Fault-tolerant collectives (mike64_t)

Lecture 17 slides

GPU Mode Discord

Build docs developers (and LLMs) love

Why multi-GPU communication is needed

Collective operations

NCCL: NVIDIA Collective Communications Library

Ring-AllReduce algorithm

PyTorch DDP

Minimal example (`ddp_simple.py`)

Full training loop example (`ddp_example.py`)

Communication/computation overlap

NCCL environment variables for tuning

Related lectures

Lecture 67: NCCL & NVSHMEM (Jeff Hammond)

Lecture 70: Fault-tolerant collectives (mike64_t)