The KV cache stores the key and value tensors produced by every transformer layer for every past token. Without it, autoregressive generation would recompute the entire context at every step. With it, each decode step processes only the single new token and reads the history from cache.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
What the cache stores
For each transformer layer and each token, the model computes a key vector and a value vector of shape(num_kv_heads, head_dim). The KV cache holds these tensors so they can be retrieved on future steps without re-running the model on past tokens.
Cache layout
The physical cache tensors are shaped:num_blocks— total number of physical pages allocated at startupblock_size— number of token slots per block (e.g. 16)num_kv_heads— number of key/value attention headshead_dim— dimension of each head vector
num_kv_heads=8, head_dim=128, block_size=16, and 1000 allocated blocks, each cache tensor (K or V) occupies 1000 × 16 × 8 × 128 × 2 bytes ≈ 32 MB.
Writing to the cache: store_kvcache
After every forward pass (both prefill and decode), the new K and V tensors are written into the cache. The store_kvcache function in layers/attention.py launches a Triton kernel with one thread per (token, kv_head) pair:
Slot mapping
Each token in the current batch is assigned a cache slot: a flat integer index into the physical cache. The kernel converts a slot index into a(block_idx, block_offset) pair:
-1 is the sentinel for tokens whose KV values were already cached (prefix cache hit). The kernel silently skips those tokens.
Slot assignment during decode
During each decode step,model_runner.prepare_decode computes the new slot for the token about to be written:
BlockManager.append guarantees a fresh block exists before the forward pass, this slot is always valid and never overlaps with an existing token.
Reading from the cache: paged attention decode
During decode, the model reads the full KV history from the paged cache using the block table. Each thread follows the indirection:Prefix caching
When multiple requests share a common prefix (e.g., the same system prompt), the KV values for those tokens only need to be computed and stored once.Content-based hashing
BlockManager.compute_hash produces a context-sensitive fingerprint for each full block using xxhash:
Hashes are computed only for full blocks. The partial trailing block of an in-progress sequence always has
hash = -1 and is never shared.Cache hit vs cache miss
- Cache hit
- Cache miss
When a new sequence is allocated, The existing physical block is added to the sequence’s block table. The model runner skips writing new KV values for those tokens (their
BlockManager.allocate looks up each block’s hash:slot_mapping entries are set to -1).Block lifecycle
Allocation
At startup,
ModelRunner.allocate_kv_cache measures peak model memory (weights + activations) and allocates as many blocks as the remaining GPU memory allows. All blocks start in free_block_ids.Prefix hit (optional)
When a sequence is scheduled, full prefix blocks that match a cached hash are reused. Their reference counts are incremented. The sequence’s
num_cached_tokens reflects how many tokens do not need to be re-attended.Active writes
During prefill and each decode step, the Triton
store_kvcache kernel writes K and V for uncached tokens into the assigned physical slots.Block completion
When a block fills up (
num_tokens % block_size == 0), BlockManager.append records its hash so subsequent sequences can reuse it as a prefix.