TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
myvllm.layers.attention module implements two core attention strategies:
- Prefill — flash attention over variable-length sequences using a Triton kernel.
- Decode — paged attention over a block-structured KV cache using a separate Triton kernel.
Attention class acts as a unified interface and selects the correct path at runtime via the shared request context.
Attention
myvllm.layers.attention.Attention
A PyTorch nn.Module that wraps both attention kernels and owns the paged KV cache buffers. On each forward pass it reads the global Context object to decide whether the engine is in prefill or decode mode, and dispatches accordingly.
When the KV cache is allocated and slot_mapping is present in the context, the module calls store_kvcache before computing attention so that new key/value pairs are written to the cache automatically.
Constructor
Total number of query heads per GPU.
Dimension of each attention head.
Multiplied with
1 / sqrt(head_dim) to produce the final attention scale factor.Number of key/value heads. Defaults to
num_heads (multi-head attention). Set to a smaller value for grouped-query attention (GQA).Number of token slots per block in the paged KV cache. Must match the block size used when allocating
k_cache and v_cache.forward
context.is_prefill.
Returns a tensor of shape (total_tokens, num_heads * head_dim) in prefill mode, or (batch_size, num_heads * head_dim) in decode mode.
The
k_cache and v_cache attributes start as empty tensors. The engine’s memory manager must allocate and assign them before the first decode step.flash_attention_prefill
Parameters
Query tensor of shape
(total_tokens, num_heads, head_dim).Key tensor of shape
(total_tokens, num_kv_heads, head_dim).Value tensor of shape
(total_tokens, num_kv_heads, head_dim).Cumulative sequence lengths tensor of shape
(num_seqs + 1,). cu_seqlens[i] is the start token index of sequence i; cu_seqlens[-1] equals total_tokens. For example, two sequences of lengths 5 and 7 would give [0, 5, 12].Attention scale factor, typically
1.0 / sqrt(head_dim).Number of query heads.
Number of key/value heads. When
num_kv_heads < num_heads, each KV head is shared across num_heads / num_kv_heads query heads (GQA).Dimension of each head.
Returns
torch.Tensor — shape (total_tokens, num_heads, head_dim).
Block sizing
Tile sizesBLOCK_M and BLOCK_N are chosen automatically to stay within the ~48 KB shared memory budget of most GPUs:
head_dim | BLOCK_M | BLOCK_N |
|---|---|---|
| ≤ 64 | 64 | 64 |
| ≤ 128 | 32 | 32 |
| > 128 | 16 | 16 |
Kernel grid
The Triton kernel is launched with grid(ceil(max_seq_len / BLOCK_M), num_heads, num_seqs). Each program processes one tile of queries for one head of one sequence.
paged_attention_decode
Parameters
Query tensor of shape
(batch_size, num_heads, head_dim). One query vector per sequence.Paged key cache of shape
(num_blocks, block_size, num_kv_heads, head_dim).Paged value cache of shape
(num_blocks, block_size, num_kv_heads, head_dim).Physical block index lookup table of shape
(batch_size, max_num_blocks). block_tables[i, j] is the physical block index for logical block j of sequence i. Use -1 for unallocated slots.1-D tensor of shape
(batch_size,) containing the number of valid KV tokens for each sequence in the cache.Attention scale factor, typically
1.0 / sqrt(head_dim).Number of query heads.
Number of key/value heads.
Dimension of each head.
Number of token slots per physical block. Must match the block size used when the cache was allocated.
Returns
torch.Tensor — shape (batch_size, num_heads, head_dim).
Kernel grid
The kernel is launched with grid(batch_size, num_heads). Each program is responsible for one query head of one sequence and iterates over the full context in chunks of BLOCK_N (64 when head_dim ≤ 128, otherwise 32).
store_kvcache
slot_mapping. Tokens whose slot is -1 are silently skipped (padding tokens).
The kernel is launched with grid (num_tokens, num_kv_heads) so that each thread processes one token for one KV head.
Parameters
Keys to store, shape
(num_tokens, num_kv_heads, head_dim).Values to store, shape
(num_tokens, num_kv_heads, head_dim).Destination key cache, shape
(num_blocks, block_size, num_kv_heads, head_dim).Destination value cache, shape
(num_blocks, block_size, num_kv_heads, head_dim).1-D integer tensor of shape
(num_tokens,). Each element is a flat cache slot index in the range [0, num_blocks * block_size), or -1 to skip the token.The slot maps to a physical location as:Number of token slots per physical block.