PagedAttention is the core memory management innovation introduced by vLLM. miniVLLM implements it end-to-end: from theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
BlockManager that allocates physical pages, to the Triton kernel that reads through the block table during decode.
The problem: fragmentation
In a naive implementation, each sequence pre-allocates a contiguous slab of GPU memory large enough for its maximum KV cache. This causes two kinds of waste:- Internal fragmentation — a sequence that generates 10 tokens wastes the memory reserved for the remaining 990.
- External fragmentation — after some sequences finish, the freed slabs are scattered across memory and cannot be combined to serve a new long sequence.
The solution: fixed-size pages
PagedAttention borrows the virtual memory idea from operating systems. KV cache is divided into fixed-size blocks (pages), each holdingblock_size tokens. A sequence’s KV cache is spread across as many blocks as it needs — they do not need to be contiguous in physical memory.
Each sequence maintains a block table: a list that maps a logical block index to a physical block ID.
block_tables to translate each logical token position into a physical cache slot:
The BlockManager class
BlockManager (engine/block_manager.py) owns all physical blocks and exposes the methods the scheduler calls at each step.
can_allocate
can_allocate
Checks whether enough free blocks exist to hold all of a new sequence’s tokens before it is moved from the waiting queue to running.
allocate
allocate
Called once when a sequence is first scheduled (prefill). Iterates over each logical block, looks for a prefix-cache hit, and allocates a fresh physical block on miss.
can_append
can_append
Called every decode step to confirm there is room to write one more token. A new physical block is only needed when
num_tokens % block_size == 0.append
append
Called after a token has been appended to the sequence object but before the model run writes its KV values. Allocates a new physical block when the last block is full, and records the content hash once a block becomes complete.
deallocate
deallocate
Decrements the reference count of every block in a sequence’s block table. A block is returned to the free pool only when its reference count reaches zero, which means no other sequence is sharing it (prefix caching can cause sharing).
Block lifecycle
New request arrives
The scheduler calls
can_allocate(seq). If sufficient free blocks exist, the sequence moves from the waiting queue to running.Prefill allocation
allocate(seq) walks the logical blocks. Full blocks that match a cached hash are reused immediately; partial or uncached blocks receive a fresh physical block from free_block_ids.Decode steps
Each step the scheduler calls
can_append(seq). If a new physical block is needed (num_tokens % block_size == 0), append(seq) allocates one. When the block is later completed, its hash is recorded for future prefix reuse.Prefix caching
Many requests share a common prefix — a system prompt, few-shot examples, or a document. PagedAttention can skip recomputing the KV values for those tokens entirely.Content-based hashing
A block’s hash is computed from its token IDs and the hash of the preceding block, making it context-sensitive:Hashes are only computed for full blocks. The partial (last) block of a sequence always has
hash = -1 and is never shared.Cache hit detection
Duringallocate, a hash lookup alone is not sufficient — hash collisions are possible. miniVLLM validates the stored token IDs too:
seq.num_cached_tokens is incremented by block_size, and the model runner will skip writing those tokens to the cache (their values already exist in the physical block).
Reference counting
When two sequences share a prefix block, both point to the same physical block ID and the block’sref_count is incremented. The block is only freed when every sharing sequence has been deallocated.