Transformers have no built-in notion of order — without explicit positional information, the model treats “the cat sat on the mat” and “the mat sat on the cat” identically. Positional encodings solve this, but the choice of encoding has deep consequences for how well a model generalizes to sequences longer than it was trained on. Songlin Yang presents ScaleML Lecture 74, covering the landscape of positional encoding schemes and introducing PaTH Attention as a new approach for long-sequence modeling. This lecture is part of the GPU Mode ScaleML Series.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
Why positional encodings matter
Self-attention computes pairwise interactions between all tokens using queries, keys, and values. The dot product that determines how much token attends to token is permutation-invariant by default — swapping and gives the same score. Positional encodings inject position information so the model can distinguish tokens at different positions. The ideal positional encoding has several properties:- Unique representation per position — two positions should never collide
- Relative distances should be representable — the model should know how far apart two tokens are
- Length generalization — the encoding should work for sequences longer than seen during training
- Efficiency — it should not add significant compute or memory overhead
Absolute positional encodings
The original Transformer used sinusoidal absolute encodings added to the token embeddings before the attention layers:Relative positional encodings
Rather than encoding the absolute index of each token, relative positional encodings encode the distance between two tokens. The attention score between token and token depends on their relative offset rather than their absolute positions. Two influential designs:- Shaw et al. (2018): learned relative position biases added to attention logits, clipped at a maximum distance
- T5 relative biases: bucketed relative position biases shared across layers, with a larger bucket for distant positions
RoPE: rotary position embedding
RoPE (Su et al., 2021) is the positional encoding used in LLaMA, Mistral, Falcon, and most modern decoder-only LLMs. It encodes position by rotating the query and key vectors in 2D subspaces before computing attention.The math
For a query vector and key vector at positions and respectively, RoPE applies a rotation matrix parameterized by the position: The dot product then becomes: The rotation matrices cancel out in a way that leaves only the relative offset in the final dot product. This means RoPE naturally encodes relative position while applying independently to each token — no cross-token interaction is needed.Why RoPE generalizes well (up to a point)
Because RoPE encodes relative distance via rotation angles, and longer sequences simply have larger rotation angles, the model can in principle extrapolate. In practice, extrapolation breaks down beyond ~2× the training length because the rotation angles fall outside the distribution seen during training.ALiBi: attention with linear biases
ALiBi (Press et al., 2022) takes a different approach: instead of modifying the query/key vectors, it adds a fixed linear bias to the attention logits based on the distance between tokens.ALiBi models tend to generalize well to lengths beyond their training context, often with graceful degradation rather than catastrophic failure. However, they may lag behind RoPE models on tasks requiring precise position tracking, because the linear bias is a coarser positional signal.
YaRN and other RoPE extensions
Several methods extend RoPE to handle sequences much longer than the training length:Position interpolation (PI)
Position interpolation (PI)
Scales down the position indices so that a model trained on 2K tokens can handle 8K by mapping 8K positions into the 2K range. Requires a short fine-tuning run to adapt. Simple and effective.
YaRN (Yet another RoPE extensioN)
YaRN (Yet another RoPE extensioN)
Applies a non-uniform scaling: low-frequency dimensions are scaled more aggressively than high-frequency ones. This preserves the fine-grained local positional signal while extending the long-range signal. YaRN achieves better quality than position interpolation with the same fine-tuning budget.
LongRoPE
LongRoPE
A search-based method that finds the optimal per-dimension scaling factors for extending RoPE, rather than using a fixed formula. Achieves state-of-the-art length generalization but requires an optimization step.
Code RoPE / NTK-aware scaling
Code RoPE / NTK-aware scaling
Modifies the base frequency (the
10000 constant in RoPE) rather than the position indices. Changes the wavelength of each dimension to accommodate longer sequences. Used in several community fine-tunes of LLaMA.PaTH Attention: the contribution of Lecture 74
PaTH (Parallel Tokenization with Hierarchical attention) is the new mechanism introduced in this lecture. The core idea is to separate attention into two regimes operating in parallel:- Local attention: dense attention over a short sliding window (captures syntax and local semantics)
- Global attention: sparse attention over a hierarchically summarized representation of the full context (captures long-range dependencies)
PaTH Attention achieves complexity rather than , while maintaining better quality than pure sliding-window approaches on tasks requiring long-range dependencies. The hierarchical structure is differentiable and trained end-to-end.
Linear attention alternatives
The lecture situates PaTH within the broader landscape of efficient sequence models. Several architectures avoid the quadratic attention bottleneck entirely:RetNet
Replaces softmax attention with a retention mechanism: exponential decay as a function of distance, equivalent to a recurrent model during inference. Supports parallel training and recurrent inference.
Mamba / S4
Structured state space models (SSMs) that process sequences as linear recurrences with learned transition matrices. Selective SSMs (Mamba) add input-dependent gating. Inference is per token.
RWKV
A hybrid of transformer-style attention and RNN-style recurrence. Uses a linear attention formulation with an exponential decay bias. Purely recurrent at inference, making it highly memory-efficient.
Lecture references
Lecture 74 slides
ScaleML Lecture 74 slides by Songlin Yang (path_talk.pdf in the lecture_074 folder)
Songlin Yang
Speaker homepage — research on efficient sequence models and attention
RoPE paper
“RoFormer: Enhanced Transformer with Rotary Position Embedding” (Su et al., 2021)
GPU Mode YouTube
Full lecture recordings on the GPU Mode YouTube channel