Skip to main content
Nano-vLLM manages GPU memory for attention key/value tensors through a paged allocator inspired by virtual memory systems. The BlockManager treats the KV cache as a pool of fixed-size blocks and assigns them to sequences on demand.

Physical Layout

The KV cache is a single contiguous tensor allocated once during ModelRunner.allocate_kv_cache():
self.kv_cache = torch.empty(
    2,                            # key and value
    hf_config.num_hidden_layers,  # one entry per transformer layer
    config.num_kvcache_blocks,    # number of physical blocks
    self.block_size,              # tokens per block (default 256)
    num_kv_heads,                 # KV heads per tensor-parallel rank
    head_dim,                     # head dimension
)
Dimension 0 separates keys (kv_cache[0]) from values (kv_cache[1]). Each attention layer gets a slice of dimension 1 assigned at startup:
for module in self.model.modules():
    if hasattr(module, "k_cache") and hasattr(module, "v_cache"):
        module.k_cache = self.kv_cache[0, layer_id]
        module.v_cache = self.kv_cache[1, layer_id]
        layer_id += 1

Block Size and Count

Each block holds block_size tokens (default 256, must be a multiple of 256 per the config assertion). The number of blocks is computed at runtime from available GPU memory:
def allocate_kv_cache(self):
    config = self.config
    hf_config = config.hf_config
    free, total = torch.cuda.mem_get_info()
    used = total - free
    peak = torch.cuda.memory_stats()["allocated_bytes.all.peak"]
    current = torch.cuda.memory_stats()["allocated_bytes.all.current"]
    num_kv_heads = hf_config.num_key_value_heads // self.world_size
    head_dim = getattr(hf_config, "head_dim",
                       hf_config.hidden_size // hf_config.num_attention_heads)
    block_bytes = (
        2 * hf_config.num_hidden_layers
        * self.block_size * num_kv_heads * head_dim
        * hf_config.torch_dtype.itemsize
    )
    config.num_kvcache_blocks = (
        int(total * config.gpu_memory_utilization - used - peak + current)
        // block_bytes
    )
The formula accounts for:
  • total * gpu_memory_utilization — the GPU memory budget (default 90% of VRAM).
  • used - peak + current — memory already committed to model weights and other allocations, adjusted by the delta between peak and current to avoid double-counting the warmup spike.

The Block Object

class Block:
    def __init__(self, block_id):
        self.block_id = block_id
        self.ref_count = 0    # 0 = free, ≥1 = in use
        self.hash = -1        # -1 = not yet hashed / incomplete block
        self.token_ids = []   # token content for hash verification
A block with ref_count == 0 is on the free list. ref_count > 1 means two or more sequences share this block (prefix cache hit).

BlockManager Operations

can_allocate

Checks whether enough free blocks exist to cover all logical blocks required by a sequence:
def can_allocate(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= seq.num_blocks
seq.num_blocks is ceil(len(seq) / block_size).

allocate

Assigns physical blocks to a sequence at prefill time. For each logical block it first attempts a hash cache hit:
def allocate(self, seq: Sequence):
    assert not seq.block_table
    h = -1
    cache_miss = False
    for i in range(seq.num_blocks):
        token_ids = seq.block(i)
        h = self.compute_hash(token_ids, h) if len(token_ids) == self.block_size else -1
        block_id = self.hash_to_block_id.get(h, -1)
        if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
            cache_miss = True
        if cache_miss:
            block_id = self.free_block_ids[0]
            block = self._allocate_block(block_id)
        else:
            seq.num_cached_tokens += self.block_size
            if block_id in self.used_block_ids:
                block = self.blocks[block_id]
                block.ref_count += 1      # share the block
            else:
                block = self._allocate_block(block_id)
        if h != -1:
            block.update(h, token_ids)
            self.hash_to_block_id[h] = block_id
        seq.block_table.append(block_id)
Once a cache miss occurs for any block, all subsequent blocks are treated as misses too — there is no point trying to match non-contiguous cached suffixes. The last (possibly partial) block of a prompt is never hashed (h is set to -1 when len(token_ids) < block_size) because its content may still change as the decode phase appends tokens.

deallocate

Releases a sequence’s blocks in reverse order. A block is only returned to the free list when its reference count reaches zero:
def deallocate(self, seq: Sequence):
    for block_id in reversed(seq.block_table):
        block = self.blocks[block_id]
        block.ref_count -= 1
        if block.ref_count == 0:
            self._deallocate_block(block_id)
    seq.num_cached_tokens = 0
    seq.block_table.clear()
Blocks shared via prefix caching retain their hash and token_ids even after ref_count drops back to zero. They sit on the free list but are still in hash_to_block_id, so a future allocation can reclaim them without re-computing KV values.

can_append and may_append

During decode, each step appends exactly one token. A new physical block is needed only when the current sequence length is exactly one past a block boundary:
def can_append(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= (len(seq) % self.block_size == 1)
may_append performs the actual work:
def may_append(self, seq: Sequence):
    block_table = seq.block_table
    last_block = self.blocks[block_table[-1]]
    if len(seq) % self.block_size == 1:
        # first token of a new block: allocate a fresh block
        assert last_block.hash != -1
        block_id = self.free_block_ids[0]
        self._allocate_block(block_id)
        block_table.append(block_id)
    elif len(seq) % self.block_size == 0:
        # just completed a full block: hash it for future cache hits
        assert last_block.hash == -1
        token_ids = seq.block(seq.num_blocks - 1)
        prefix = self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1
        h = self.compute_hash(token_ids, prefix)
        last_block.update(h, token_ids)
        self.hash_to_block_id[h] = last_block.block_id
    # else: mid-block, no structural change needed

Hash-Based Prefix Caching

Prefix caching allows sequences that share a common prompt prefix to reuse each other’s KV cache without re-computing attention. This is especially effective for chat templates and system prompts that are identical across many requests.
Hashes are chained: each block’s hash incorporates the hash of the previous block. This makes the hash a fingerprint of the entire token sequence up to that point, not just the block’s own tokens.
@classmethod
def compute_hash(cls, token_ids: list[int], prefix: int = -1):
    h = xxhash.xxh64()
    if prefix != -1:
        h.update(prefix.to_bytes(8, "little"))
    h.update(np.array(token_ids).tobytes())
    return h.intdigest()
xxhash.xxh64 is a non-cryptographic hash chosen for speed. The 64-bit digest is stored as a Python int. A hash of -1 is used as a sentinel meaning “not yet computed” (the xxh64 digest space is [0, 2^64) so -1 is never a valid digest).

Cache Hit Path

During allocate(), for each complete block:
  1. Compute the chained hash.
  2. Look up hash_to_block_id.
  3. If found, verify block.token_ids == token_ids (collision guard).
  4. If the block is in used_block_ids, increment ref_count and share it.
  5. If the block is in free_block_ids, re-activate it with _allocate_block.
  6. Increment seq.num_cached_tokens by block_size.
The scheduler then charges only len(seq) - seq.num_cached_tokens tokens against the max_num_batched_tokens budget, and prepare_prefill() skips the cached blocks when building the attention inputs.

Reference Counting

ref_countMeaning
0Block is on the free list; may be reused
1Block is exclusively owned by one sequence
≥ 2Block is shared by multiple sequences (prefix cache hit)
Blocks with ref_count >= 2 must never be modified. The allocator guarantees this: decode tokens are written only to blocks owned by a single sequence (the last, incomplete block), and completed full blocks are immediately registered in the hash map and become eligible for sharing.

Build docs developers (and LLMs) love