Transformers process context through attention over a key-value (KV) cache, and that cache grows linearly with sequence length. At 100K tokens, a single KV cache for a 7B-parameter model can consume tens of gigabytes of memory — making long-context inference both slow and expensive. Guangxuan Xiao presents ScaleML Lecture 72, covering the techniques his research group has developed to extend context windows without paying the full linear cost. This lecture is part of the GPU Mode ScaleML Series.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
The long-context challenge
Standard attention has two costs that both scale with sequence length :- Memory: the KV cache stores tensors per layer
- Compute: attention scores are in the sequence length
KV cache size
For LLaMA-2-7B with a 128K token context: ~16 GB just for the KV cache, exceeding the usable VRAM of most consumer GPUs.
Attention compute
Quadratic attention complexity means doubling the context quadruples the attention FLOPs. Even with FlashAttention, the compute cost is real.
StreamingLLM: attention sinks and sliding window
StreamingLLM is Guangxuan Xiao’s work on enabling LLMs to operate on infinite-length streams without retraining or fine-tuning. The key insight comes from analyzing where attention mass concentrates.Attention sinks
When you examine attention patterns in trained LLMs, a surprising pattern emerges: a disproportionate fraction of attention weight is concentrated on the first few tokens of the sequence, regardless of their semantic content. These initial tokens act as “attention sinks” — they absorb attention probability that would otherwise be spread across the full context.The attention sink phenomenon is not specific to any one model family. It appears in LLaMA, Mistral, Falcon, and other decoder-only transformers. The first token is particularly affected because it is always visible (never masked) to every subsequent token during causal pretraining.
Sliding window with sink tokens
StreamingLLM exploits this observation to build a streaming-compatible attention mechanism:- Always keep the first tokens (the sinks) in the KV cache
- Keep the most recent tokens in a sliding window
- Evict everything in between
SnapKV: selective KV cache compression
While StreamingLLM drops tokens based on recency, SnapKV takes a quality-aware approach: it identifies which KV entries are actually important for answering a query and keeps only those. The core observation is that attention patterns during prompt processing are predictive of which KV entries will matter during generation. SnapKV compresses the prompt’s KV cache before generation begins by:Process the prompt normally
Run the full prefill pass to compute attention over the complete prompt. This produces standard KV entries for every prompt token.
Measure observation frequency
For each key position in the prompt, count how often it receives high attention weight across the last few query positions (a proxy for the query tokens that matter most).
Select important positions
Keep only the top- positions by observation frequency, plus a local window around the most recent tokens. Evict the rest.
SnapKV is most effective for long-prompt, short-generation tasks (e.g., document QA, summarization). For tasks where generation itself is long, the growing generation cache eventually dominates memory anyway.
Sparse attention patterns
Both StreamingLLM and SnapKV are instances of a broader family of sparse attention methods. Instead of computing full attention, sparse methods restrict each query to attending over a structured subset of keys. The two canonical primitives are:| Pattern | Description | Good for |
|---|---|---|
| Local / sliding window | Each token attends to its nearest neighbors | Capturing syntactic and local semantic structure |
| Global / landmark | A small set of special tokens attends to and is attended to by all positions | Propagating long-range information |
| Strided | Every -th position is included in the attention set | Efficient long-range coverage |
| Sink + local | StreamingLLM’s pattern: fixed sinks + sliding window | Streaming inference |
KV cache eviction strategies
When operating under a memory budget, you must decide which KV entries to evict. The lecture surveys four strategies:Recency-based (StreamingLLM)
Recency-based (StreamingLLM)
Keep the most recent tokens. Simple and predictable. Works well when relevant context is local. Fails for retrieval-heavy tasks where the answer is in the distant past.
Attention-score-based (H2O, SnapKV)
Attention-score-based (H2O, SnapKV)
Keep tokens that historically received high attention. Tracks which keys are “heavy hitters.” More expensive to maintain (requires running statistics) but significantly better for long-document tasks.
Learned importance scoring
Learned importance scoring
Train a small auxiliary network to predict which KV entries will be needed. Highest quality but requires training and adds inference overhead.
Hybrid (local + important)
Hybrid (local + important)
Always keep a recent window (recency) plus a fixed budget for globally important tokens (score-based). Combines the benefits of both. Used by SnapKV and several follow-up works.
Evaluation metrics for long-context
Evaluating long-context models requires benchmarks that actually require long-range reasoning, not ones that can be answered from a short local window.RULER
A synthetic benchmark with controlled needle-in-a-haystack, variable tracking, and aggregation tasks at specified context lengths (4K to 128K). Tests whether a model can actually use distant context.
LongBench
A multi-task benchmark covering single-document QA, multi-document QA, summarization, few-shot learning, and code tasks. Real-world, not synthetic.
HELMET
A more recent benchmark emphasizing recall and multi-hop reasoning at very long contexts (32K–512K tokens).
SCROLLS
Summarization and QA over long documents, with average lengths in the tens of thousands of tokens.
Memory-compute tradeoffs
The lecture concludes with a unified view of the tradeoff space:Lecture references
Lecture 72 slides
ScaleML Lecture 72 slides by Guangxuan Xiao (StreamingLLM.pdf in the lecture_072 folder)
Guangxuan Xiao
Speaker homepage — research on efficient LLM inference
StreamingLLM paper
“Efficient Streaming Language Models with Attention Sinks” (Xiao et al., 2023)
GPU Mode YouTube
Full lecture recordings on the GPU Mode YouTube channel