Engine Internals

LLMEngine

LLMEngine (nanovllm/engine/llm_engine.py) is the orchestrator that glues together the config, tokenizer, model runner, and scheduler. The public LLM class is a direct subclass with no additional logic.Initialisation

class LLMEngine:
    def __init__(self, model, **kwargs):
        config_fields = {field.name for field in fields(Config)}
        config_kwargs = {k: v for k, v in kwargs.items() if k in config_fields}
        config = Config(model, **config_kwargs)
        self.ps = []
        self.events = []
        ctx = mp.get_context("spawn")
        for i in range(1, config.tensor_parallel_size):
            event = ctx.Event()
            process = ctx.Process(target=ModelRunner, args=(config, i, event))
            process.start()
            self.ps.append(process)
            self.events.append(event)
        self.model_runner = ModelRunner(config, 0, self.events)
        self.tokenizer = AutoTokenizer.from_pretrained(config.model, use_fast=True)
        config.eos = self.tokenizer.eos_token_id
        self.scheduler = Scheduler(config)
        atexit.register(self.exit)

For tensor_parallel_size > 1, worker processes are spawned — one per additional GPU (ranks 1…N-1). Rank 0 (ModelRunner on the main process) coordinates them via shared memory and multiprocessing.Event signals.Methods

Method	Description
`add_request(prompt, sampling_params)`	Tokenises the prompt (if a string) and enqueues a new `Sequence` in the scheduler.
`step()`	Runs one scheduling + inference step. Returns `(outputs, num_tokens)`.
`is_finished()`	Delegates to `Scheduler.is_finished()`.
`generate(prompts, sampling_params, use_tqdm)`	Full generation loop; collects results ordered by original prompt index.

generate loop

def generate(self, prompts, sampling_params, use_tqdm=True):
    ...
    outputs = {}
    while not self.is_finished():
        t = perf_counter()
        output, num_tokens = self.step()
        ...
        for seq_id, token_ids in output:
            outputs[seq_id] = token_ids
    outputs = [outputs[seq_id] for seq_id in sorted(outputs.keys())]
    outputs = [{"text": self.tokenizer.decode(token_ids), "token_ids": token_ids} for token_ids in outputs]
    return outputs

Scheduler

Scheduler (nanovllm/engine/scheduler.py) manages two FIFO queues — waiting and running — and decides which sequences to process on each step.Initialisation

class Scheduler:
    def __init__(self, config: Config):
        self.max_num_seqs = config.max_num_seqs
        self.max_num_batched_tokens = config.max_num_batched_tokens
        self.eos = config.eos
        self.block_manager = BlockManager(config.num_kvcache_blocks, config.kvcache_block_size)
        self.waiting: deque[Sequence] = deque()
        self.running: deque[Sequence] = deque()

schedule() — prefill vs. decodeschedule() returns (list[Sequence], is_prefill: bool). It always tries to schedule a prefill batch first:

def schedule(self) -> tuple[list[Sequence], bool]:
    # prefill
    scheduled_seqs = []
    num_seqs = 0
    num_batched_tokens = 0
    while self.waiting and num_seqs < self.max_num_seqs:
        seq = self.waiting[0]
        if num_batched_tokens + len(seq) > self.max_num_batched_tokens \
                or not self.block_manager.can_allocate(seq):
            break
        num_seqs += 1
        self.block_manager.allocate(seq)
        num_batched_tokens += len(seq) - seq.num_cached_tokens
        seq.status = SequenceStatus.RUNNING
        self.waiting.popleft()
        self.running.append(seq)
        scheduled_seqs.append(seq)
    if scheduled_seqs:
        return scheduled_seqs, True

    # decode
    while self.running and num_seqs < self.max_num_seqs:
        seq = self.running.popleft()
        while not self.block_manager.can_append(seq):
            if self.running:
                self.preempt(self.running.pop())
            else:
                self.preempt(seq)
                break
        else:
            num_seqs += 1
            self.block_manager.may_append(seq)
            scheduled_seqs.append(seq)
    ...
    return scheduled_seqs, False

If no waiting sequences can be prefilled, it falls through to decode all currently-running sequences, preempting those that cannot get a new KV cache block.postprocess(seqs, token_ids)Called after the model returns new tokens. Updates each sequence with its new token, marks it FINISHED if EOS is hit or max_tokens is reached, and deallocates its KV cache blocks.

def postprocess(self, seqs: list[Sequence], token_ids: list[int]):
    for seq, token_id in zip(seqs, token_ids):
        seq.append_token(token_id)
        if (not seq.ignore_eos and token_id == self.eos) \
                or seq.num_completion_tokens == seq.max_tokens:
            seq.status = SequenceStatus.FINISHED
            self.block_manager.deallocate(seq)
            self.running.remove(seq)

preempt(seq)Moves a running sequence back to the head of the waiting queue and frees all its KV cache blocks. The sequence will be re-prefilled (with prefix-cache reuse where possible) on a future step.

def preempt(self, seq: Sequence):
    seq.status = SequenceStatus.WAITING
    self.block_manager.deallocate(seq)
    self.waiting.appendleft(seq)

Sequence

Sequence (nanovllm/engine/sequence.py) represents a single in-flight request. Each instance is assigned a monotonically increasing seq_id.Status

class SequenceStatus(Enum):
    WAITING = auto()
    RUNNING = auto()
    FINISHED = auto()

Constructor

class Sequence:
    block_size = 256
    counter = count()

    def __init__(self, token_ids: list[int], sampling_params=SamplingParams()):
        self.seq_id = next(Sequence.counter)
        self.status = SequenceStatus.WAITING
        self.token_ids = copy(token_ids)
        self.last_token = token_ids[-1]
        self.num_tokens = len(self.token_ids)
        self.num_prompt_tokens = len(token_ids)
        self.num_cached_tokens = 0
        self.block_table = []
        self.temperature = sampling_params.temperature
        self.max_tokens = sampling_params.max_tokens
        self.ignore_eos = sampling_params.ignore_eos

Key properties

Property	Type	Description
`is_finished`	`bool`	`True` when `status == FINISHED`.
`num_completion_tokens`	`int`	Tokens generated so far (excludes prompt).
`prompt_token_ids`	`list[int]`	Slice of `token_ids` up to `num_prompt_tokens`.
`completion_token_ids`	`list[int]`	Slice of `token_ids` after `num_prompt_tokens`.
`num_blocks`	`int`	Number of KV cache blocks needed: `ceil(num_tokens / block_size)`.
`num_cached_blocks`	`int`	Blocks already present in the KV cache (prefix cache hits).
`last_block_num_tokens`	`int`	Tokens occupying the last (possibly partial) block.

SerialisationSequence implements __getstate__ / __setstate__ for efficient pickling when sequences are passed to worker processes over shared memory. Only the minimum needed state is serialised — for sequences that have started generating, only the last_token is included rather than the full token_ids list.

BlockManager

BlockManager (nanovllm/engine/block_manager.py) manages a fixed pool of Block objects. It implements prefix caching: if a sequence’s prompt tokens match blocks already resident in the cache, those blocks are reused without re-computation.Block

class Block:
    def __init__(self, block_id):
        self.block_id = block_id
        self.ref_count = 0
        self.hash = -1       # -1 means unhashed / mutable
        self.token_ids = []

Blocks with hash != -1 are considered fully written and content-addressable. Multiple sequences can share such a block (ref-counted).BlockManager constructor

class BlockManager:
    def __init__(self, num_blocks: int, block_size: int):
        self.block_size = block_size
        self.blocks: list[Block] = [Block(i) for i in range(num_blocks)]
        self.hash_to_block_id: dict[int, int] = dict()
        self.free_block_ids: deque[int] = deque(range(num_blocks))
        self.used_block_ids: set[int] = set()

Key methods

Method	Description
`can_allocate(seq)`	Returns `True` if there are enough free blocks for the full sequence.
`allocate(seq)`	Walks `seq.num_blocks` blocks, reusing cached blocks where the hash matches, and allocates new blocks for cache misses. Sets `seq.num_cached_tokens`.
`deallocate(seq)`	Decrements ref counts; blocks that reach `ref_count == 0` are returned to the free pool.
`can_append(seq)`	Returns `True` if a new decode step can proceed — `True` when a new block is not needed, or one free block exists.
`may_append(seq)`	Allocates a new block if `len(seq) % block_size == 1` (start of a new block), or finalises the hash of the just-completed block if `len(seq) % block_size == 0`.

HashingBlock content is hashed with xxhash.xxh64, chaining the previous block’s hash as a prefix so that the hash uniquely identifies the entire token prefix up to that block:

@classmethod
def compute_hash(cls, token_ids: list[int], prefix: int = -1):
    h = xxhash.xxh64()
    if prefix != -1:
        h.update(prefix.to_bytes(8, "little"))
    h.update(np.array(token_ids).tobytes())
    return h.intdigest()

Core API

Internals

Build docs developers (and LLMs) love