Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Transformers have no built-in notion of order — without explicit positional information, the model treats “the cat sat on the mat” and “the mat sat on the cat” identically. Positional encodings solve this, but the choice of encoding has deep consequences for how well a model generalizes to sequences longer than it was trained on. Songlin Yang presents ScaleML Lecture 74, covering the landscape of positional encoding schemes and introducing PaTH Attention as a new approach for long-sequence modeling. This lecture is part of the GPU Mode ScaleML Series.

Why positional encodings matter

Self-attention computes pairwise interactions between all tokens using queries, keys, and values. The dot product qikjq_i \cdot k_j that determines how much token ii attends to token jj is permutation-invariant by default — swapping ii and jj gives the same score. Positional encodings inject position information so the model can distinguish tokens at different positions. The ideal positional encoding has several properties:
  • Unique representation per position — two positions should never collide
  • Relative distances should be representable — the model should know how far apart two tokens are
  • Length generalization — the encoding should work for sequences longer than seen during training
  • Efficiency — it should not add significant compute or memory overhead
No existing scheme fully satisfies all four. The tension between properties 2 and 3 is particularly sharp, and it drives most of the research this lecture covers.

Absolute positional encodings

The original Transformer used sinusoidal absolute encodings added to the token embeddings before the attention layers:
import torch
import math

def sinusoidal_encoding(seq_len, d_model):
    """
    Returns sinusoidal positional encodings.
    Shape: [seq_len, d_model]
    """
    position = torch.arange(seq_len).unsqueeze(1).float()
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
    )
    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe
Learned absolute encodings (used in BERT, GPT-2) replace the sinusoid formula with trainable vectors — one per position. Both approaches embed position directly into the token representation before attention.
Learned absolute encodings cannot generalize beyond the maximum position seen during training. A model trained with sequence length 512 has no embedding for position 513. Sinusoidal encodings can extrapolate in principle, but in practice they also degrade significantly at out-of-distribution lengths.

Relative positional encodings

Rather than encoding the absolute index of each token, relative positional encodings encode the distance between two tokens. The attention score between token ii and token jj depends on their relative offset (ij)(i - j) rather than their absolute positions. Two influential designs:
  • Shaw et al. (2018): learned relative position biases added to attention logits, clipped at a maximum distance
  • T5 relative biases: bucketed relative position biases shared across layers, with a larger bucket for distant positions
Relative encodings give better length generalization than absolute encodings because the model is never asked to represent an unseen absolute position — only offsets, which may have been seen during training.

RoPE: rotary position embedding

RoPE (Su et al., 2021) is the positional encoding used in LLaMA, Mistral, Falcon, and most modern decoder-only LLMs. It encodes position by rotating the query and key vectors in 2D subspaces before computing attention.

The math

For a query vector qq and key vector kk at positions mm and nn respectively, RoPE applies a rotation matrix RθR_\theta parameterized by the position: qm=Rθ,mq,kn=Rθ,nkq_m = R_{\theta,m} \cdot q, \quad k_n = R_{\theta,n} \cdot k The dot product then becomes: qmkn=qTRθ,mTRθ,nk=qTRθ,nmkq_m \cdot k_n = q^T R_{\theta,m}^T R_{\theta,n} k = q^T R_{\theta,n-m} k The rotation matrices cancel out in a way that leaves only the relative offset (nm)(n - m) in the final dot product. This means RoPE naturally encodes relative position while applying independently to each token — no cross-token interaction is needed.
import torch

def apply_rope(x, cos, sin):
    """
    Apply RoPE to query or key tensor.
    x:   [batch, heads, seq_len, head_dim]
    cos: [seq_len, head_dim/2]
    sin: [seq_len, head_dim/2]
    """
    # Split into pairs of dimensions
    x1, x2 = x[..., ::2], x[..., 1::2]

    # Rotate: (x1, x2) -> (x1*cos - x2*sin, x1*sin + x2*cos)
    rotated = torch.stack([
        x1 * cos - x2 * sin,
        x1 * sin + x2 * cos,
    ], dim=-1)

    return rotated.flatten(-2)

def build_rope_cache(seq_len, head_dim, base=10000, device="cpu"):
    """Build RoPE cos/sin cache."""
    theta = 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device).float() / head_dim))
    t = torch.arange(seq_len, device=device).float()
    freqs = torch.outer(t, theta)
    return freqs.cos(), freqs.sin()
RoPE is applied in the attention computation, not to the token embeddings. This means positional information is injected at every layer, giving the model more opportunities to use it. It also means the embeddings themselves are position-agnostic, which is useful for caching.

Why RoPE generalizes well (up to a point)

Because RoPE encodes relative distance via rotation angles, and longer sequences simply have larger rotation angles, the model can in principle extrapolate. In practice, extrapolation breaks down beyond ~2× the training length because the rotation angles fall outside the distribution seen during training.

ALiBi: attention with linear biases

ALiBi (Press et al., 2022) takes a different approach: instead of modifying the query/key vectors, it adds a fixed linear bias to the attention logits based on the distance between tokens.
import torch

def build_alibi_bias(num_heads, seq_len, device="cpu"):
    """
    Returns ALiBi bias matrix: [num_heads, seq_len, seq_len]
    Each head uses a different slope m_i.
    """
    # Slopes are powers of 2, geometrically spaced
    slopes = torch.pow(2, -torch.arange(1, num_heads + 1).float() / num_heads)
    slopes = slopes.to(device)

    # Distance matrix: position j relative to query position i
    positions = torch.arange(seq_len, device=device)
    distances = positions.unsqueeze(0) - positions.unsqueeze(1)  # [seq, seq]
    distances = distances.clamp(max=0)  # causal: only past tokens

    # bias[h, i, j] = -slope[h] * |i - j|
    bias = slopes[:, None, None] * distances[None, :, :]
    return bias
ALiBi does not encode position in the representations at all — position is a pure attention-score modifier. This makes ALiBi models naturally length-generalizable: you can apply the same bias formula at any sequence length.
ALiBi models tend to generalize well to lengths beyond their training context, often with graceful degradation rather than catastrophic failure. However, they may lag behind RoPE models on tasks requiring precise position tracking, because the linear bias is a coarser positional signal.

YaRN and other RoPE extensions

Several methods extend RoPE to handle sequences much longer than the training length:
Scales down the position indices so that a model trained on 2K tokens can handle 8K by mapping 8K positions into the 2K range. Requires a short fine-tuning run to adapt. Simple and effective.
Applies a non-uniform scaling: low-frequency dimensions are scaled more aggressively than high-frequency ones. This preserves the fine-grained local positional signal while extending the long-range signal. YaRN achieves better quality than position interpolation with the same fine-tuning budget.
A search-based method that finds the optimal per-dimension scaling factors for extending RoPE, rather than using a fixed formula. Achieves state-of-the-art length generalization but requires an optimization step.
Modifies the base frequency (the 10000 constant in RoPE) rather than the position indices. Changes the wavelength of each dimension to accommodate longer sequences. Used in several community fine-tunes of LLaMA.

PaTH Attention: the contribution of Lecture 74

PaTH (Parallel Tokenization with Hierarchical attention) is the new mechanism introduced in this lecture. The core idea is to separate attention into two regimes operating in parallel:
  • Local attention: dense attention over a short sliding window (captures syntax and local semantics)
  • Global attention: sparse attention over a hierarchically summarized representation of the full context (captures long-range dependencies)
Input tokens: [t_1, t_2, ..., t_N]
                 |                  |
          Local attention     Hierarchical summary
          (window size w)     (log N levels)
                 |                  |
            Local output      Global output
                 \                  /
                  \                /
                   [ Combine via gate ]
                          |
                     Final output
The hierarchical summary is built by progressively pooling token representations at increasing granularities (similar to a pyramid or segment tree), giving each token access to O(logN)O(\log N) context summaries that together cover the full sequence.
PaTH Attention achieves O(NlogN)O(N \log N) complexity rather than O(N2)O(N^2), while maintaining better quality than pure sliding-window approaches on tasks requiring long-range dependencies. The hierarchical structure is differentiable and trained end-to-end.

Linear attention alternatives

The lecture situates PaTH within the broader landscape of efficient sequence models. Several architectures avoid the quadratic attention bottleneck entirely:

RetNet

Replaces softmax attention with a retention mechanism: exponential decay as a function of distance, equivalent to a recurrent model during inference. Supports parallel training and recurrent inference.

Mamba / S4

Structured state space models (SSMs) that process sequences as linear recurrences with learned transition matrices. Selective SSMs (Mamba) add input-dependent gating. Inference is O(1)O(1) per token.

RWKV

A hybrid of transformer-style attention and RNN-style recurrence. Uses a linear attention formulation with an exponential decay bias. Purely recurrent at inference, making it highly memory-efficient.
The choice between full attention, efficient attention (PaTH, Longformer), and linear models (Mamba, RWKV) depends heavily on the task. Full attention is still the best on tasks requiring precise retrieval from distant positions; linear models win on speed for streaming generation; hybrid approaches like PaTH try to occupy the middle ground.

Lecture references

Lecture 74 slides

ScaleML Lecture 74 slides by Songlin Yang (path_talk.pdf in the lecture_074 folder)

Songlin Yang

Speaker homepage — research on efficient sequence models and attention

RoPE paper

“RoFormer: Enhanced Transformer with Rotary Position Embedding” (Su et al., 2021)

GPU Mode YouTube

Full lecture recordings on the GPU Mode YouTube channel

Build docs developers (and LLMs) love