Nano-vLLM manages GPU memory for attention key/value tensors through a paged allocator inspired by virtual memory systems. The BlockManager treats the KV cache as a pool of fixed-size blocks and assigns them to sequences on demand.
Physical Layout
The KV cache is a single contiguous tensor allocated once during ModelRunner.allocate_kv_cache():
self.kv_cache = torch.empty(
2, # key and value
hf_config.num_hidden_layers, # one entry per transformer layer
config.num_kvcache_blocks, # number of physical blocks
self.block_size, # tokens per block (default 256)
num_kv_heads, # KV heads per tensor-parallel rank
head_dim, # head dimension
)
Dimension 0 separates keys (kv_cache[0]) from values (kv_cache[1]). Each attention layer gets a slice of dimension 1 assigned at startup:
for module in self.model.modules():
if hasattr(module, "k_cache") and hasattr(module, "v_cache"):
module.k_cache = self.kv_cache[0, layer_id]
module.v_cache = self.kv_cache[1, layer_id]
layer_id += 1
Block Size and Count
Each block holds block_size tokens (default 256, must be a multiple of 256 per the config assertion). The number of blocks is computed at runtime from available GPU memory:
def allocate_kv_cache(self):
config = self.config
hf_config = config.hf_config
free, total = torch.cuda.mem_get_info()
used = total - free
peak = torch.cuda.memory_stats()["allocated_bytes.all.peak"]
current = torch.cuda.memory_stats()["allocated_bytes.all.current"]
num_kv_heads = hf_config.num_key_value_heads // self.world_size
head_dim = getattr(hf_config, "head_dim",
hf_config.hidden_size // hf_config.num_attention_heads)
block_bytes = (
2 * hf_config.num_hidden_layers
* self.block_size * num_kv_heads * head_dim
* hf_config.torch_dtype.itemsize
)
config.num_kvcache_blocks = (
int(total * config.gpu_memory_utilization - used - peak + current)
// block_bytes
)
The formula accounts for:
total * gpu_memory_utilization — the GPU memory budget (default 90% of VRAM).
used - peak + current — memory already committed to model weights and other allocations, adjusted by the delta between peak and current to avoid double-counting the warmup spike.
The Block Object
class Block:
def __init__(self, block_id):
self.block_id = block_id
self.ref_count = 0 # 0 = free, ≥1 = in use
self.hash = -1 # -1 = not yet hashed / incomplete block
self.token_ids = [] # token content for hash verification
A block with ref_count == 0 is on the free list. ref_count > 1 means two or more sequences share this block (prefix cache hit).
BlockManager Operations
can_allocate
Checks whether enough free blocks exist to cover all logical blocks required by a sequence:
def can_allocate(self, seq: Sequence) -> bool:
return len(self.free_block_ids) >= seq.num_blocks
seq.num_blocks is ceil(len(seq) / block_size).
allocate
Assigns physical blocks to a sequence at prefill time. For each logical block it first attempts a hash cache hit:
def allocate(self, seq: Sequence):
assert not seq.block_table
h = -1
cache_miss = False
for i in range(seq.num_blocks):
token_ids = seq.block(i)
h = self.compute_hash(token_ids, h) if len(token_ids) == self.block_size else -1
block_id = self.hash_to_block_id.get(h, -1)
if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
cache_miss = True
if cache_miss:
block_id = self.free_block_ids[0]
block = self._allocate_block(block_id)
else:
seq.num_cached_tokens += self.block_size
if block_id in self.used_block_ids:
block = self.blocks[block_id]
block.ref_count += 1 # share the block
else:
block = self._allocate_block(block_id)
if h != -1:
block.update(h, token_ids)
self.hash_to_block_id[h] = block_id
seq.block_table.append(block_id)
Once a cache miss occurs for any block, all subsequent blocks are treated as misses too — there is no point trying to match non-contiguous cached suffixes.
The last (possibly partial) block of a prompt is never hashed (h is set to -1 when len(token_ids) < block_size) because its content may still change as the decode phase appends tokens.
deallocate
Releases a sequence’s blocks in reverse order. A block is only returned to the free list when its reference count reaches zero:
def deallocate(self, seq: Sequence):
for block_id in reversed(seq.block_table):
block = self.blocks[block_id]
block.ref_count -= 1
if block.ref_count == 0:
self._deallocate_block(block_id)
seq.num_cached_tokens = 0
seq.block_table.clear()
Blocks shared via prefix caching retain their hash and token_ids even after ref_count drops back to zero. They sit on the free list but are still in hash_to_block_id, so a future allocation can reclaim them without re-computing KV values.
can_append and may_append
During decode, each step appends exactly one token. A new physical block is needed only when the current sequence length is exactly one past a block boundary:
def can_append(self, seq: Sequence) -> bool:
return len(self.free_block_ids) >= (len(seq) % self.block_size == 1)
may_append performs the actual work:
def may_append(self, seq: Sequence):
block_table = seq.block_table
last_block = self.blocks[block_table[-1]]
if len(seq) % self.block_size == 1:
# first token of a new block: allocate a fresh block
assert last_block.hash != -1
block_id = self.free_block_ids[0]
self._allocate_block(block_id)
block_table.append(block_id)
elif len(seq) % self.block_size == 0:
# just completed a full block: hash it for future cache hits
assert last_block.hash == -1
token_ids = seq.block(seq.num_blocks - 1)
prefix = self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1
h = self.compute_hash(token_ids, prefix)
last_block.update(h, token_ids)
self.hash_to_block_id[h] = last_block.block_id
# else: mid-block, no structural change needed
Hash-Based Prefix Caching
Prefix caching allows sequences that share a common prompt prefix to reuse each other’s KV cache without re-computing attention. This is especially effective for chat templates and system prompts that are identical across many requests.
Hashes are chained: each block’s hash incorporates the hash of the previous block. This makes the hash a fingerprint of the entire token sequence up to that point, not just the block’s own tokens.
@classmethod
def compute_hash(cls, token_ids: list[int], prefix: int = -1):
h = xxhash.xxh64()
if prefix != -1:
h.update(prefix.to_bytes(8, "little"))
h.update(np.array(token_ids).tobytes())
return h.intdigest()
xxhash.xxh64 is a non-cryptographic hash chosen for speed. The 64-bit digest is stored as a Python int. A hash of -1 is used as a sentinel meaning “not yet computed” (the xxh64 digest space is [0, 2^64) so -1 is never a valid digest).
Cache Hit Path
During allocate(), for each complete block:
- Compute the chained hash.
- Look up
hash_to_block_id.
- If found, verify
block.token_ids == token_ids (collision guard).
- If the block is in
used_block_ids, increment ref_count and share it.
- If the block is in
free_block_ids, re-activate it with _allocate_block.
- Increment
seq.num_cached_tokens by block_size.
The scheduler then charges only len(seq) - seq.num_cached_tokens tokens against the max_num_batched_tokens budget, and prepare_prefill() skips the cached blocks when building the attention inputs.
Reference Counting
ref_count | Meaning |
|---|
0 | Block is on the free list; may be reused |
1 | Block is exclusively owned by one sequence |
≥ 2 | Block is shared by multiple sequences (prefix cache hit) |
Blocks with ref_count >= 2 must never be modified. The allocator guarantees this: decode tokens are written only to blocks owned by a single sequence (the last, incomplete block), and completed full blocks are immediately registered in the hash map and become eligible for sharing.