Ring Attention solves one of the fundamental constraints of transformer scaling: as sequence length grows, even Flash Attention’s O(N) memory per GPU becomes too large to fit on a single device. Ring Attention distributes the sequence across a ring of GPUs, allowing each device to hold only a fraction of the sequence while collectively computing exact attention. This page is based on Lecture 13 by Andreas Koepf.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
The long-sequence problem
Flash Attention reduces attention’s memory from O(N²) to O(N) — a massive improvement. But O(N) still grows linearly. At sequence length 128K with head dimension 128, a single attention head requires roughly 64 MB of activations per GPU just for Q, K, V, and the output. For a 70-billion-parameter model with 64 heads, that is over 4 GB per layer, per forward pass. The GPU memory wall means there is a hard upper bound on sequence length for single-GPU attention. Ring Attention breaks that ceiling.Ring Attention computes exact attention — not approximate. The key insight is that Flash Attention’s tiling approach generalizes naturally to a distributed setting where different GPUs hold different tiles of K and V.
How Ring Attention distributes the sequence
Ring Attention assigns each GPU a contiguous chunk of the sequence:- GPU 0 holds tokens
[0, N/P) - GPU 1 holds tokens
[N/P, 2N/P) - …
- GPU P-1 holds tokens
[(P-1)N/P, N)
The ring communication pattern
The algorithm runs for P rounds. In each round, every GPU:- Computes attention between its local Q chunk and the current K/V chunk it holds
- Updates its local Flash Attention accumulators (running max, normalizer, output)
- Sends the current K/V chunk to the next GPU in the ring
- Receives a new K/V chunk from the previous GPU
Combining with Flash Attention tiling
Ring Attention works at the level of Flash Attention’s outer loop. Each GPU runs Flash Attention for its Q chunk, but instead of iterating over local K/V tiles in the inner loop, it iterates over K/V chunks from the ring:exp(m_old - m_new) before adding the new partial result — the same update rule used within a single GPU’s tile loop.
Memory scaling
The memory cost on each GPU scales as:| Quantity | Memory per GPU |
|---|---|
| Q, K, V projections | O(N/P · d) |
| Flash Attention accumulators | O(N/P) |
| K/V communication buffer | O(N/P · d) (one chunk in flight) |
| Total | O(N/P) |
Implementation with NCCL
Ring Attention uses NCCL’s point-to-point primitives (ncclSend / ncclRecv) to pass K/V chunks between adjacent GPUs. The critical optimization is overlapping communication with computation: while one GPU computes attention for the current K/V chunk, it simultaneously sends that chunk to the next GPU.
True overlap between compute and communication requires CUDA streams. The K/V send should be launched on a separate CUDA stream from the Flash Attention kernel so both operations proceed concurrently.
Practical use cases
Ring Attention is well-suited to tasks that require long context at training or inference time:- Long-context LLMs: models trained on books, codebases, or long documents (e.g., 128K–1M token contexts)
- Multi-modal inputs: combining high-resolution image tokens with text in a single sequence
- Scientific sequences: genomic data, protein structure prediction, or long time-series
- Sliding-window hybrid: Ring Attention can be combined with local/sparse attention patterns to reduce communication while keeping global context
Causal masking considerations
For decoder-only models with causal masking, tokens in a chunk only attend to tokens in the same or earlier chunks. This means GPUs holding later chunks do less work (they mask out future tokens). Load balancing across the ring becomes uneven for causal attention. A common fix is to assign chunks in an interleaved or zigzag pattern so that every GPU handles both early and late tokens, balancing the masked-out fraction.Further reading
Lecture 13 slides
Andreas Koepf’s Ring Attention slides
Flash Attention
The single-GPU tiling algorithm that Ring Attention builds on
GPU Collectives & NCCL
Lecture 17: collective communication primitives used for ring passes
Ring Attention paper
Liu et al., 2023 — Ring Attention with Blockwise Transformers