When you scale training beyond a single GPU, every backward pass requires synchronizing gradients across devices. NCCL — NVIDIA’s Collective Communications Library — is the engine that makes this fast. Lecture 17, presented by Dan Johnson, explains what NCCL is, why distributed training needs it, and how it works under the hood. This page walks through the core concepts and the real code from the lecture.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
Why multi-GPU communication is needed
Modern deep learning models and datasets are too large to fit on a single GPU. Two primary strategies split the work:- Data parallelism — each GPU holds a full copy of the model and processes a different mini-batch. Gradients must be averaged across all GPUs before the optimizer step.
- Model parallelism — layers or tensor dimensions are split across GPUs. Activations must be passed between devices during the forward and backward passes.
Collective operations
A collective is an operation where every process in a group participates — both sending and receiving data. NCCL implements the following primitives:AllReduce
Every GPU contributes a tensor. The result (sum, max, etc.) is returned to all GPUs. Used for gradient averaging in DDP.
Broadcast
One GPU sends a tensor to all other GPUs. Used to synchronize model weights at initialization.
Reduce
Every GPU contributes a tensor. The result is sent to one root GPU only.
Scatter
One GPU splits a tensor into equal chunks and sends one chunk to each GPU.
Gather
Each GPU sends its tensor to one root GPU, which concatenates them.
AllGather
Each GPU sends its tensor and all GPUs receive the full concatenated result.
AllReduce is by far the most common collective in data-parallel training. It is equivalent to a Reduce followed by a Broadcast, but NCCL implements it more efficiently using ring or tree algorithms.
NCCL: NVIDIA Collective Communications Library
NCCL (pronounced “nickel”) is a purpose-built library for multi-GPU and multi-node collective communication. Key properties:- Topology-aware: automatically detects NVLink, NVSwitch, PCIe, and network fabrics and selects the best communication path.
- Asynchronous: operations are enqueued on CUDA streams so they can overlap with computation.
- Backend for PyTorch: when you call
dist.init_process_group("nccl"), PyTorch uses NCCL for all GPU tensors.
Ring-AllReduce algorithm
NCCL implements AllReduce using a ring topology. GivenN GPUs arranged in a ring:
Scatter-Reduce phase
Each GPU splits its tensor into
N chunks. Over N-1 steps, each GPU sends one chunk to its right neighbor and receives one chunk from its left neighbor, accumulating a partial reduction. After this phase, each GPU holds one fully-reduced chunk.2 * (N-1)/N * tensor_size, which approaches 2 * tensor_size as N grows. This makes Ring-AllReduce bandwidth-optimal — the per-GPU communication cost is nearly independent of the number of GPUs.
PyTorch DDP
DistributedDataParallel (DDP) is PyTorch’s high-level wrapper around NCCL. It handles:
- Replicating the model to each GPU rank.
- Scattering mini-batches across ranks.
- Triggering AllReduce on gradients during the backward pass.
- Averaging gradients so every rank applies an identical optimizer step.
Minimal example (ddp_simple.py)
This is the exact code from Lecture 17. It demonstrates the core DDP setup with a toy single-parameter model:
torchrun:
dy/dw = 7*0 = 0 and dy/dw = 7*1 = 7 respectively. After AllReduce (sum + divide by 2), both ranks receive the averaged gradient 3.5.
Full training loop example (ddp_example.py)
The second file from the lecture adds an optimizer, a loss function, and a 10-step training loop on a realistic model size (4000-dimensional linear layers):
bucket_cap_mb=25 controls how DDP groups gradients into communication buckets. Larger buckets reduce NCCL call overhead at the cost of delaying gradient overlap. Tune this for your model size and network bandwidth.Communication/computation overlap
DDP does not wait for the full backward pass to finish before starting AllReduce. Instead, it fires AllReduce on each gradient bucket as soon as the last gradient in that bucket is computed. This overlaps communication with the remaining backward computation. The PyTorch profiler trace exported by both example scripts (e.g.trace_ddp_example.json) makes this overlap visible. Open it in chrome://tracing or Perfetto to see NCCL AllReduce ops running alongside CUDA backward kernels.
NCCL environment variables for tuning
NCCL exposes its behavior through environment variables. The most useful ones:| Variable | Default | Effect |
|---|---|---|
NCCL_DEBUG | "" | Set to INFO or WARN to enable NCCL logging |
NCCL_DEBUG_SUBSYS | "" | Filter logs to a subsystem, e.g. GRAPH,COLL |
NCCL_SOCKET_IFNAME | auto | Force a specific network interface (e.g. eth0) |
NCCL_IB_DISABLE | 0 | Set to 1 to disable InfiniBand and fall back to IP |
NCCL_P2P_DISABLE | 0 | Set to 1 to disable peer-to-peer GPU transfers |
NCCL_ALGO | auto | Force an algorithm: Ring, Tree, or CollNet |
NCCL_PROTO | auto | Force a protocol: Simple, LL, or LL128 |
NCCL_BUFFSIZE | auto | Internal buffer size in bytes |
Related lectures
Lecture 67: NCCL & NVSHMEM (Jeff Hammond)
Lecture 67 by Jeff Hammond goes deeper into both NCCL and NVSHMEM (NVIDIA’s shared-memory model for multi-GPU programming). NVSHMEM enables GPU threads to directly read and write memory on remote GPUs using a partitioned global address space (PGAS), avoiding explicit send/receive calls.Lecture 70: Fault-tolerant collectives (mike64_t)
Production distributed training jobs fail — a GPU goes down, a node is preempted, a network link flaps. Lecture 70 covers how to build collective operations that tolerate these failures without restarting the entire job.Lecture 17 slides
Dan Johnson’s original lecture slides on GPU collective communication
GPU Mode Discord
Ask questions and discuss NCCL, DDP, and distributed training