Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt

Use this file to discover all available pages before exploring further.

PagedAttention is the core memory management innovation introduced by vLLM. miniVLLM implements it end-to-end: from the BlockManager that allocates physical pages, to the Triton kernel that reads through the block table during decode.

The problem: fragmentation

In a naive implementation, each sequence pre-allocates a contiguous slab of GPU memory large enough for its maximum KV cache. This causes two kinds of waste:
  • Internal fragmentation — a sequence that generates 10 tokens wastes the memory reserved for the remaining 990.
  • External fragmentation — after some sequences finish, the freed slabs are scattered across memory and cannot be combined to serve a new long sequence.
Because sequence lengths are unknown at request time, a serving system cannot reliably pre-size allocations without either wasting large amounts of memory or frequently running out.

The solution: fixed-size pages

PagedAttention borrows the virtual memory idea from operating systems. KV cache is divided into fixed-size blocks (pages), each holding block_size tokens. A sequence’s KV cache is spread across as many blocks as it needs — they do not need to be contiguous in physical memory. Each sequence maintains a block table: a list that maps a logical block index to a physical block ID.
Logical view (sequence)     Physical memory
──────────────────────      ──────────────────────
Block 0 (tokens 0–15)  →   Physical block 7
Block 1 (tokens 16–31) →   Physical block 2
Block 2 (tokens 32–47) →   Physical block 14
During decode, the Triton kernel consults block_tables to translate each logical token position into a physical cache slot:
# paged_attention_decode_kernel — attention.py
block_num    = token_idx // block_size
block_offset = token_idx % block_size
block_table_offset   = batch_idx * max_num_blocks + block_num
physical_block_idx   = tl.load(block_tables_ptr + block_table_offset)

k_offset = (physical_block_idx * block_size * num_kv_heads * head_dim
            + block_offset * num_kv_heads * head_dim
            + kv_head_idx * head_dim + offs_d)
k_vec = tl.load(k_cache_ptr + k_offset)

The BlockManager class

BlockManager (engine/block_manager.py) owns all physical blocks and exposes the methods the scheduler calls at each step.
class BlockManager:
    def __init__(self, num_blocks: int, block_size: int):
        self.block_size = block_size
        self.blocks = [Block(i) for i in range(num_blocks)]
        self.hash_to_block_id: dict[int, int] = {}
        self.free_block_ids: deque[int] = deque(range(num_blocks))
        self.used_block_ids: set[int] = set()
Checks whether enough free blocks exist to hold all of a new sequence’s tokens before it is moved from the waiting queue to running.
def can_allocate(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= seq.num_blocks
Called once when a sequence is first scheduled (prefill). Iterates over each logical block, looks for a prefix-cache hit, and allocates a fresh physical block on miss.
def allocate(self, seq: Sequence) -> None:
    h = -1
    for i in range(seq.num_blocks):
        token_ids = seq.block(i)
        h = self.compute_hash(token_ids, h) if len(token_ids) == self.block_size else -1
        block_id = self.hash_to_block_id.get(h, -1)

        if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
            # cache miss — allocate fresh block
            block = self._allocate_block(self.free_block_ids[0])
            block.update(h=h, token_ids=token_ids)
            if h != -1:
                self.hash_to_block_id[h] = block.block_id
        else:
            # cache hit — reuse existing block
            seq.num_cached_tokens += self.block_size
            if block_id not in self.used_block_ids:
                block = self._allocate_block(block_id)
            else:
                block = self.blocks[block_id]
                block.ref_count += 1

        seq.block_table.append(block.block_id)
Called every decode step to confirm there is room to write one more token. A new physical block is only needed when num_tokens % block_size == 0.
def can_append(self, seq: Sequence) -> bool:
    if seq.num_tokens % self.block_size == 0:
        return len(self.free_block_ids) > 0
    return True
Called after a token has been appended to the sequence object but before the model run writes its KV values. Allocates a new physical block when the last block is full, and records the content hash once a block becomes complete.
def append(self, seq: Sequence) -> None:
    block_tables = seq.block_table
    last_block_id = block_tables[-1]

    if seq.num_tokens % self.block_size == 0:
        # last block just became full — compute and record its hash
        h = self.compute_hash(
            token_ids=seq.block(seq.num_blocks - 1),
            prefix_hash_value=(
                -1 if len(block_tables) == 1
                else self.blocks[block_tables[-2]].hash
            ),
        )
        block = self.blocks[last_block_id]
        block.update(h=h, token_ids=seq.block(seq.num_blocks - 1))
        self.hash_to_block_id[h] = block.block_id
    elif seq.num_tokens % self.block_size == 1:
        # first token of a new block — allocate it
        block = self._allocate_block(self.free_block_ids[0])
        block_tables.append(block.block_id)
Decrements the reference count of every block in a sequence’s block table. A block is returned to the free pool only when its reference count reaches zero, which means no other sequence is sharing it (prefix caching can cause sharing).
def deallocate(self, seq: Sequence) -> None:
    for block_id in seq.block_table:
        block = self.blocks[block_id]
        block.ref_count -= 1
        if block.ref_count == 0:
            self._deallocate_block(block_id)
    seq.block_table = []
    seq.num_cached_tokens = 0

Block lifecycle

1

New request arrives

The scheduler calls can_allocate(seq). If sufficient free blocks exist, the sequence moves from the waiting queue to running.
2

Prefill allocation

allocate(seq) walks the logical blocks. Full blocks that match a cached hash are reused immediately; partial or uncached blocks receive a fresh physical block from free_block_ids.
3

Decode steps

Each step the scheduler calls can_append(seq). If a new physical block is needed (num_tokens % block_size == 0), append(seq) allocates one. When the block is later completed, its hash is recorded for future prefix reuse.
4

Sequence finishes

deallocate(seq) decrements all reference counts. Blocks that reach zero are pushed back onto free_block_ids and can be reused immediately.

Prefix caching

Many requests share a common prefix — a system prompt, few-shot examples, or a document. PagedAttention can skip recomputing the KV values for those tokens entirely.

Content-based hashing

A block’s hash is computed from its token IDs and the hash of the preceding block, making it context-sensitive:
def compute_hash(self, token_ids: list[int], prefix_hash_value: int) -> int:
    h = xxhash.xxh64()
    if prefix_hash_value != -1:
        h.update(prefix_hash_value.to_bytes(8, 'little'))
    h.update(np.array(token_ids, dtype=np.int32).tobytes())
    return h.intdigest()
This means two blocks containing identical token IDs but following different prefixes get different hashes, preventing false sharing.
Hashes are only computed for full blocks. The partial (last) block of a sequence always has hash = -1 and is never shared.

Cache hit detection

During allocate, a hash lookup alone is not sufficient — hash collisions are possible. miniVLLM validates the stored token IDs too:
if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
    # definitive cache miss
On a confirmed hit, seq.num_cached_tokens is incremented by block_size, and the model runner will skip writing those tokens to the cache (their values already exist in the physical block).

Reference counting

When two sequences share a prefix block, both point to the same physical block ID and the block’s ref_count is incremented. The block is only freed when every sharing sequence has been deallocated.
Prefix caching is most effective for workloads with a long, stable system prompt. The first request pays the full prefill cost; every subsequent request with the same prompt hits the cache and only processes its unique tokens.

Build docs developers (and LLMs) love