KV Cache Management

Nano-vLLM manages GPU memory for attention key/value tensors through a paged allocator inspired by virtual memory systems. The BlockManager treats the KV cache as a pool of fixed-size blocks and assigns them to sequences on demand.

Physical Layout

The KV cache is a single contiguous tensor allocated once during ModelRunner.allocate_kv_cache():

self.kv_cache = torch.empty(
    2,                            # key and value
    hf_config.num_hidden_layers,  # one entry per transformer layer
    config.num_kvcache_blocks,    # number of physical blocks
    self.block_size,              # tokens per block (default 256)
    num_kv_heads,                 # KV heads per tensor-parallel rank
    head_dim,                     # head dimension
)

Dimension 0 separates keys (kv_cache[0]) from values (kv_cache[1]). Each attention layer gets a slice of dimension 1 assigned at startup:

for module in self.model.modules():
    if hasattr(module, "k_cache") and hasattr(module, "v_cache"):
        module.k_cache = self.kv_cache[0, layer_id]
        module.v_cache = self.kv_cache[1, layer_id]
        layer_id += 1

Block Size and Count

Each block holds block_size tokens (default 256, must be a multiple of 256 per the config assertion). The number of blocks is computed at runtime from available GPU memory:

def allocate_kv_cache(self):
    config = self.config
    hf_config = config.hf_config
    free, total = torch.cuda.mem_get_info()
    used = total - free
    peak = torch.cuda.memory_stats()["allocated_bytes.all.peak"]
    current = torch.cuda.memory_stats()["allocated_bytes.all.current"]
    num_kv_heads = hf_config.num_key_value_heads // self.world_size
    head_dim = getattr(hf_config, "head_dim",
                       hf_config.hidden_size // hf_config.num_attention_heads)
    block_bytes = (
        2 * hf_config.num_hidden_layers
        * self.block_size * num_kv_heads * head_dim
        * hf_config.torch_dtype.itemsize
    )
    config.num_kvcache_blocks = (
        int(total * config.gpu_memory_utilization - used - peak + current)
        // block_bytes
    )

The formula accounts for:

total * gpu_memory_utilization — the GPU memory budget (default 90% of VRAM).
used - peak + current — memory already committed to model weights and other allocations, adjusted by the delta between peak and current to avoid double-counting the warmup spike.

The Block Object

class Block:
    def __init__(self, block_id):
        self.block_id = block_id
        self.ref_count = 0    # 0 = free, ≥1 = in use
        self.hash = -1        # -1 = not yet hashed / incomplete block
        self.token_ids = []   # token content for hash verification

A block with ref_count == 0 is on the free list. ref_count > 1 means two or more sequences share this block (prefix cache hit).

BlockManager Operations

can_allocate

Checks whether enough free blocks exist to cover all logical blocks required by a sequence:

def can_allocate(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= seq.num_blocks

seq.num_blocks is ceil(len(seq) / block_size).

allocate

Assigns physical blocks to a sequence at prefill time. For each logical block it first attempts a hash cache hit:

def allocate(self, seq: Sequence):
    assert not seq.block_table
    h = -1
    cache_miss = False
    for i in range(seq.num_blocks):
        token_ids = seq.block(i)
        h = self.compute_hash(token_ids, h) if len(token_ids) == self.block_size else -1
        block_id = self.hash_to_block_id.get(h, -1)
        if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
            cache_miss = True
        if cache_miss:
            block_id = self.free_block_ids[0]
            block = self._allocate_block(block_id)
        else:
            seq.num_cached_tokens += self.block_size
            if block_id in self.used_block_ids:
                block = self.blocks[block_id]
                block.ref_count += 1      # share the block
            else:
                block = self._allocate_block(block_id)
        if h != -1:
            block.update(h, token_ids)
            self.hash_to_block_id[h] = block_id
        seq.block_table.append(block_id)

Once a cache miss occurs for any block, all subsequent blocks are treated as misses too — there is no point trying to match non-contiguous cached suffixes. The last (possibly partial) block of a prompt is never hashed (h is set to -1 when len(token_ids) < block_size) because its content may still change as the decode phase appends tokens.

deallocate

Releases a sequence’s blocks in reverse order. A block is only returned to the free list when its reference count reaches zero:

def deallocate(self, seq: Sequence):
    for block_id in reversed(seq.block_table):
        block = self.blocks[block_id]
        block.ref_count -= 1
        if block.ref_count == 0:
            self._deallocate_block(block_id)
    seq.num_cached_tokens = 0
    seq.block_table.clear()

Blocks shared via prefix caching retain their hash and token_ids even after ref_count drops back to zero. They sit on the free list but are still in hash_to_block_id, so a future allocation can reclaim them without re-computing KV values.

can_append and may_append

During decode, each step appends exactly one token. A new physical block is needed only when the current sequence length is exactly one past a block boundary:

def can_append(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= (len(seq) % self.block_size == 1)

may_append performs the actual work:

def may_append(self, seq: Sequence):
    block_table = seq.block_table
    last_block = self.blocks[block_table[-1]]
    if len(seq) % self.block_size == 1:
        # first token of a new block: allocate a fresh block
        assert last_block.hash != -1
        block_id = self.free_block_ids[0]
        self._allocate_block(block_id)
        block_table.append(block_id)
    elif len(seq) % self.block_size == 0:
        # just completed a full block: hash it for future cache hits
        assert last_block.hash == -1
        token_ids = seq.block(seq.num_blocks - 1)
        prefix = self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1
        h = self.compute_hash(token_ids, prefix)
        last_block.update(h, token_ids)
        self.hash_to_block_id[h] = last_block.block_id
    # else: mid-block, no structural change needed

Hash-Based Prefix Caching

Prefix caching allows sequences that share a common prompt prefix to reuse each other’s KV cache without re-computing attention. This is especially effective for chat templates and system prompts that are identical across many requests.

Hashes are chained: each block’s hash incorporates the hash of the previous block. This makes the hash a fingerprint of the entire token sequence up to that point, not just the block’s own tokens.

@classmethod
def compute_hash(cls, token_ids: list[int], prefix: int = -1):
    h = xxhash.xxh64()
    if prefix != -1:
        h.update(prefix.to_bytes(8, "little"))
    h.update(np.array(token_ids).tobytes())
    return h.intdigest()

xxhash.xxh64 is a non-cryptographic hash chosen for speed. The 64-bit digest is stored as a Python int. A hash of -1 is used as a sentinel meaning “not yet computed” (the xxh64 digest space is [0, 2^64) so -1 is never a valid digest).

Cache Hit Path

During allocate(), for each complete block:

Compute the chained hash.
Look up hash_to_block_id.
If found, verify block.token_ids == token_ids (collision guard).
If the block is in used_block_ids, increment ref_count and share it.
If the block is in free_block_ids, re-activate it with _allocate_block.
Increment seq.num_cached_tokens by block_size.

The scheduler then charges only len(seq) - seq.num_cached_tokens tokens against the max_num_batched_tokens budget, and prepare_prefill() skips the cached blocks when building the attention inputs.

Reference Counting

`ref_count`	Meaning
`0`	Block is on the free list; may be reused
`1`	Block is exclusively owned by one sequence
`≥ 2`	Block is shared by multiple sequences (prefix cache hit)

Blocks with ref_count >= 2 must never be modified. The allocator guarantees this: decode tokens are written only to blocks owned by a single sequence (the last, incomplete block), and completed full blocks are immediately registered in the hash map and become eligible for sharing.

Get Started

Guides

Architecture

KV Cache Management

Physical Layout

Block Size and Count

The Block Object

BlockManager Operations

can_allocate

allocate

deallocate

can_append and may_append

Hash-Based Prefix Caching

Cache Hit Path

Reference Counting

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​Physical Layout

​Block Size and Count

​The Block Object

​BlockManager Operations

​can_allocate

​allocate

​deallocate

​can_append and may_append

​Hash-Based Prefix Caching

​Cache Hit Path

​Reference Counting

Build docs developers (and LLMs) love

Physical Layout

Block Size and Count

The Block Object

BlockManager Operations

can_allocate

allocate

deallocate

can_append and may_append

Hash-Based Prefix Caching

Cache Hit Path

Reference Counting