Skip to main content
This page documents the internal classes that power the nano-vLLM inference loop. These are not part of the public API but are useful for understanding how the engine works or for extending it.
LLMEngine (nanovllm/engine/llm_engine.py) is the orchestrator that glues together the config, tokenizer, model runner, and scheduler. The public LLM class is a direct subclass with no additional logic.Initialisation
class LLMEngine:
    def __init__(self, model, **kwargs):
        config_fields = {field.name for field in fields(Config)}
        config_kwargs = {k: v for k, v in kwargs.items() if k in config_fields}
        config = Config(model, **config_kwargs)
        self.ps = []
        self.events = []
        ctx = mp.get_context("spawn")
        for i in range(1, config.tensor_parallel_size):
            event = ctx.Event()
            process = ctx.Process(target=ModelRunner, args=(config, i, event))
            process.start()
            self.ps.append(process)
            self.events.append(event)
        self.model_runner = ModelRunner(config, 0, self.events)
        self.tokenizer = AutoTokenizer.from_pretrained(config.model, use_fast=True)
        config.eos = self.tokenizer.eos_token_id
        self.scheduler = Scheduler(config)
        atexit.register(self.exit)
For tensor_parallel_size > 1, worker processes are spawned — one per additional GPU (ranks 1…N-1). Rank 0 (ModelRunner on the main process) coordinates them via shared memory and multiprocessing.Event signals.Methods
MethodDescription
add_request(prompt, sampling_params)Tokenises the prompt (if a string) and enqueues a new Sequence in the scheduler.
step()Runs one scheduling + inference step. Returns (outputs, num_tokens).
is_finished()Delegates to Scheduler.is_finished().
generate(prompts, sampling_params, use_tqdm)Full generation loop; collects results ordered by original prompt index.
generate loop
def generate(self, prompts, sampling_params, use_tqdm=True):
    ...
    outputs = {}
    while not self.is_finished():
        t = perf_counter()
        output, num_tokens = self.step()
        ...
        for seq_id, token_ids in output:
            outputs[seq_id] = token_ids
    outputs = [outputs[seq_id] for seq_id in sorted(outputs.keys())]
    outputs = [{"text": self.tokenizer.decode(token_ids), "token_ids": token_ids} for token_ids in outputs]
    return outputs
Scheduler (nanovllm/engine/scheduler.py) manages two FIFO queues — waiting and running — and decides which sequences to process on each step.Initialisation
class Scheduler:
    def __init__(self, config: Config):
        self.max_num_seqs = config.max_num_seqs
        self.max_num_batched_tokens = config.max_num_batched_tokens
        self.eos = config.eos
        self.block_manager = BlockManager(config.num_kvcache_blocks, config.kvcache_block_size)
        self.waiting: deque[Sequence] = deque()
        self.running: deque[Sequence] = deque()
schedule() — prefill vs. decodeschedule() returns (list[Sequence], is_prefill: bool). It always tries to schedule a prefill batch first:
def schedule(self) -> tuple[list[Sequence], bool]:
    # prefill
    scheduled_seqs = []
    num_seqs = 0
    num_batched_tokens = 0
    while self.waiting and num_seqs < self.max_num_seqs:
        seq = self.waiting[0]
        if num_batched_tokens + len(seq) > self.max_num_batched_tokens \
                or not self.block_manager.can_allocate(seq):
            break
        num_seqs += 1
        self.block_manager.allocate(seq)
        num_batched_tokens += len(seq) - seq.num_cached_tokens
        seq.status = SequenceStatus.RUNNING
        self.waiting.popleft()
        self.running.append(seq)
        scheduled_seqs.append(seq)
    if scheduled_seqs:
        return scheduled_seqs, True

    # decode
    while self.running and num_seqs < self.max_num_seqs:
        seq = self.running.popleft()
        while not self.block_manager.can_append(seq):
            if self.running:
                self.preempt(self.running.pop())
            else:
                self.preempt(seq)
                break
        else:
            num_seqs += 1
            self.block_manager.may_append(seq)
            scheduled_seqs.append(seq)
    ...
    return scheduled_seqs, False
If no waiting sequences can be prefilled, it falls through to decode all currently-running sequences, preempting those that cannot get a new KV cache block.postprocess(seqs, token_ids)Called after the model returns new tokens. Updates each sequence with its new token, marks it FINISHED if EOS is hit or max_tokens is reached, and deallocates its KV cache blocks.
def postprocess(self, seqs: list[Sequence], token_ids: list[int]):
    for seq, token_id in zip(seqs, token_ids):
        seq.append_token(token_id)
        if (not seq.ignore_eos and token_id == self.eos) \
                or seq.num_completion_tokens == seq.max_tokens:
            seq.status = SequenceStatus.FINISHED
            self.block_manager.deallocate(seq)
            self.running.remove(seq)
preempt(seq)Moves a running sequence back to the head of the waiting queue and frees all its KV cache blocks. The sequence will be re-prefilled (with prefix-cache reuse where possible) on a future step.
def preempt(self, seq: Sequence):
    seq.status = SequenceStatus.WAITING
    self.block_manager.deallocate(seq)
    self.waiting.appendleft(seq)
Sequence (nanovllm/engine/sequence.py) represents a single in-flight request. Each instance is assigned a monotonically increasing seq_id.Status
class SequenceStatus(Enum):
    WAITING = auto()
    RUNNING = auto()
    FINISHED = auto()
Constructor
class Sequence:
    block_size = 256
    counter = count()

    def __init__(self, token_ids: list[int], sampling_params=SamplingParams()):
        self.seq_id = next(Sequence.counter)
        self.status = SequenceStatus.WAITING
        self.token_ids = copy(token_ids)
        self.last_token = token_ids[-1]
        self.num_tokens = len(self.token_ids)
        self.num_prompt_tokens = len(token_ids)
        self.num_cached_tokens = 0
        self.block_table = []
        self.temperature = sampling_params.temperature
        self.max_tokens = sampling_params.max_tokens
        self.ignore_eos = sampling_params.ignore_eos
Key properties
PropertyTypeDescription
is_finishedboolTrue when status == FINISHED.
num_completion_tokensintTokens generated so far (excludes prompt).
prompt_token_idslist[int]Slice of token_ids up to num_prompt_tokens.
completion_token_idslist[int]Slice of token_ids after num_prompt_tokens.
num_blocksintNumber of KV cache blocks needed: ceil(num_tokens / block_size).
num_cached_blocksintBlocks already present in the KV cache (prefix cache hits).
last_block_num_tokensintTokens occupying the last (possibly partial) block.
SerialisationSequence implements __getstate__ / __setstate__ for efficient pickling when sequences are passed to worker processes over shared memory. Only the minimum needed state is serialised — for sequences that have started generating, only the last_token is included rather than the full token_ids list.
BlockManager (nanovllm/engine/block_manager.py) manages a fixed pool of Block objects. It implements prefix caching: if a sequence’s prompt tokens match blocks already resident in the cache, those blocks are reused without re-computation.Block
class Block:
    def __init__(self, block_id):
        self.block_id = block_id
        self.ref_count = 0
        self.hash = -1       # -1 means unhashed / mutable
        self.token_ids = []
Blocks with hash != -1 are considered fully written and content-addressable. Multiple sequences can share such a block (ref-counted).BlockManager constructor
class BlockManager:
    def __init__(self, num_blocks: int, block_size: int):
        self.block_size = block_size
        self.blocks: list[Block] = [Block(i) for i in range(num_blocks)]
        self.hash_to_block_id: dict[int, int] = dict()
        self.free_block_ids: deque[int] = deque(range(num_blocks))
        self.used_block_ids: set[int] = set()
Key methods
MethodDescription
can_allocate(seq)Returns True if there are enough free blocks for the full sequence.
allocate(seq)Walks seq.num_blocks blocks, reusing cached blocks where the hash matches, and allocates new blocks for cache misses. Sets seq.num_cached_tokens.
deallocate(seq)Decrements ref counts; blocks that reach ref_count == 0 are returned to the free pool.
can_append(seq)Returns True if a new decode step can proceed — True when a new block is not needed, or one free block exists.
may_append(seq)Allocates a new block if len(seq) % block_size == 1 (start of a new block), or finalises the hash of the just-completed block if len(seq) % block_size == 0.
HashingBlock content is hashed with xxhash.xxh64, chaining the previous block’s hash as a prefix so that the hash uniquely identifies the entire token prefix up to that block:
@classmethod
def compute_hash(cls, token_ids: list[int], prefix: int = -1):
    h = xxhash.xxh64()
    if prefix != -1:
        h.update(prefix.to_bytes(8, "little"))
    h.update(np.array(token_ids).tobytes())
    return h.intdigest()

Build docs developers (and LLMs) love