Documentation Index
Fetch the complete documentation index at: https://mintlify.com/meta-llama/llama/llms.txt
Use this file to discover all available pages before exploring further.
Overview
TheAttention class implements multi-head attention with support for Grouped-Query Attention (GQA), rotary position embeddings (RoPE), and key-value caching for efficient autoregressive generation.
Definition
Initialization
Parameters
Attributes
The__init__ method creates the following attributes:
Number of key and value heads. Defaults to
args.n_heads if args.n_kv_heads is None.Number of local query heads after model parallelism partitioning. Computed as
args.n_heads // model_parallel_size.Number of local key and value heads after model parallelism partitioning. Computed as
n_kv_heads // model_parallel_size.Number of repetitions for local heads in Grouped-Query Attention. Computed as
n_local_heads // n_local_kv_heads.Dimension size of each attention head. Computed as
args.dim // args.n_heads.Linear transformation for queries. Projects from
args.dim to args.n_heads * head_dim without bias.Linear transformation for keys. Projects from
args.dim to n_kv_heads * head_dim without bias.Linear transformation for values. Projects from
args.dim to n_kv_heads * head_dim without bias.Linear transformation for output. Projects from
args.n_heads * head_dim to args.dim without bias.Cached keys for attention with shape
(max_batch_size, max_seq_len, n_local_kv_heads, head_dim). Pre-allocated on CUDA.Cached values for attention with shape
(max_batch_size, max_seq_len, n_local_kv_heads, head_dim). Pre-allocated on CUDA.Forward Pass
Parameters
Input tensor with shape
(batch_size, seq_len, dim).Starting position for caching. Used to index into the KV cache for autoregressive generation.
Precomputed frequency tensor for rotary position embeddings (complex exponentials).
Attention mask tensor. When provided, added to attention scores before softmax to prevent attending to certain positions.
Returns
Output tensor after attention with shape
(batch_size, seq_len, dim).Grouped-Query Attention (GQA)
The module implements GQA by allowing different numbers of query heads (n_heads) and key-value heads (n_kv_heads):
- When
n_kv_heads == n_heads: Standard multi-head attention - When
n_kv_heads < n_heads: Grouped-Query Attention (more efficient)
repeat_kv function (model.py:164-173) to match the number of query heads:
Attention Mechanism
The forward pass implements scaled dot-product attention with caching:- Project inputs: Apply linear transformations
wq,wk,wv(model.py:274) - Reshape: Split into multiple heads (model.py:276-278)
- Apply RoPE: Apply rotary position embeddings to queries and keys (model.py:280)
- Update cache: Store current keys and values in cache (model.py:285-286)
- Retrieve from cache: Get all keys and values up to current position (model.py:288-289)
- Repeat KV: Expand key-value heads for GQA (model.py:292-293)
- Compute scores: Calculate attention scores with scaling (model.py:298)
- Apply mask: Add attention mask if provided (model.py:299-300)
- Softmax: Normalize scores (model.py:301)
- Apply attention: Multiply scores with values (model.py:302)
- Output projection: Apply
wotransformation (model.py:304)
Usage in TransformerBlock
The Attention module is instantiated inTransformerBlock and applied with pre-normalization: