LLMEngine
LLMEngine
LLMEngine (nanovllm/engine/llm_engine.py) is the orchestrator that glues together the config, tokenizer, model runner, and scheduler. The public LLM class is a direct subclass with no additional logic.Initialisationtensor_parallel_size > 1, worker processes are spawned — one per additional GPU (ranks 1…N-1). Rank 0 (ModelRunner on the main process) coordinates them via shared memory and multiprocessing.Event signals.Methods| Method | Description |
|---|---|
add_request(prompt, sampling_params) | Tokenises the prompt (if a string) and enqueues a new Sequence in the scheduler. |
step() | Runs one scheduling + inference step. Returns (outputs, num_tokens). |
is_finished() | Delegates to Scheduler.is_finished(). |
generate(prompts, sampling_params, use_tqdm) | Full generation loop; collects results ordered by original prompt index. |
Scheduler
Scheduler
Scheduler (nanovllm/engine/scheduler.py) manages two FIFO queues — waiting and running — and decides which sequences to process on each step.Initialisationschedule() returns (list[Sequence], is_prefill: bool). It always tries to schedule a prefill batch first:FINISHED if EOS is hit or max_tokens is reached, and deallocates its KV cache blocks.Sequence
Sequence
Sequence (nanovllm/engine/sequence.py) represents a single in-flight request. Each instance is assigned a monotonically increasing seq_id.Status| Property | Type | Description |
|---|---|---|
is_finished | bool | True when status == FINISHED. |
num_completion_tokens | int | Tokens generated so far (excludes prompt). |
prompt_token_ids | list[int] | Slice of token_ids up to num_prompt_tokens. |
completion_token_ids | list[int] | Slice of token_ids after num_prompt_tokens. |
num_blocks | int | Number of KV cache blocks needed: ceil(num_tokens / block_size). |
num_cached_blocks | int | Blocks already present in the KV cache (prefix cache hits). |
last_block_num_tokens | int | Tokens occupying the last (possibly partial) block. |
Sequence implements __getstate__ / __setstate__ for efficient pickling when sequences are passed to worker processes over shared memory. Only the minimum needed state is serialised — for sequences that have started generating, only the last_token is included rather than the full token_ids list.BlockManager
BlockManager
BlockManager (nanovllm/engine/block_manager.py) manages a fixed pool of Block objects. It implements prefix caching: if a sequence’s prompt tokens match blocks already resident in the cache, those blocks are reused without re-computation.Blockhash != -1 are considered fully written and content-addressable. Multiple sequences can share such a block (ref-counted).BlockManager constructor| Method | Description |
|---|---|
can_allocate(seq) | Returns True if there are enough free blocks for the full sequence. |
allocate(seq) | Walks seq.num_blocks blocks, reusing cached blocks where the hash matches, and allocates new blocks for cache misses. Sets seq.num_cached_tokens. |
deallocate(seq) | Decrements ref counts; blocks that reach ref_count == 0 are returned to the free pool. |
can_append(seq) | Returns True if a new decode step can proceed — True when a new block is not needed, or one free block exists. |
may_append(seq) | Allocates a new block if len(seq) % block_size == 1 (start of a new block), or finalises the hash of the just-completed block if len(seq) % block_size == 0. |
xxhash.xxh64, chaining the previous block’s hash as a prefix so that the hash uniquely identifies the entire token prefix up to that block: