Architecture Overview

Nano-vLLM is structured around a small set of focused components. Each one has a single responsibility, and they communicate through well-defined interfaces. The diagram below shows how a request flows through the system from the user-facing API down to GPU execution.

Component Map

┌─────────────────────────────────────────────────────────────┐
│                        LLM (user API)                       │
│              generate() / add_request() / step()           │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                        LLMEngine                            │
│         tokenize → schedule → run → postprocess             │
└──────────────┬────────────────────────────┬─────────────────┘
               │                            │
               ▼                            ▼
┌──────────────────────────┐   ┌────────────────────────────┐
│         Scheduler        │   │        ModelRunner         │
│  waiting deque           │   │  dist init / model load    │
│  running deque           │   │  prefill / decode paths    │
│  preemption logic        │   │  CUDA graph replay         │
└──────────┬───────────────┘   └──────────┬─────────────────┘
           │                              │
           ▼                              ▼
┌──────────────────────────┐   ┌────────────────────────────┐
│       BlockManager       │   │    Qwen3ForCausalLM        │
│  free / used block sets  │   │    + Sampler               │
│  hash → block_id map     │   │                            │
│  ref-counted allocation  │   │                            │
└──────────────────────────┘   └────────────────────────────┘

Components

LLMEngine

LLMEngine is the central orchestrator. On construction it spawns one worker process per tensor-parallel rank beyond rank 0, creates the ModelRunner for rank 0 in the main process, loads the tokenizer, and wires up the Scheduler. The generate() loop repeatedly calls step(), which in turn calls scheduler.schedule(), dispatches to the model runner, and hands new token IDs back to scheduler.postprocess().

def step(self):
    seqs, is_prefill = self.scheduler.schedule()
    token_ids = self.model_runner.call("run", seqs, is_prefill)
    self.scheduler.postprocess(seqs, token_ids)
    outputs = [(seq.seq_id, seq.completion_token_ids) for seq in seqs if seq.is_finished]
    num_tokens = sum(len(seq) for seq in seqs) if is_prefill else -len(seqs)
    return outputs, num_tokens

Scheduler

Scheduler maintains two deque objects — waiting and running — and decides each step whether to run a prefill batch or a decode batch. It delegates memory decisions to BlockManager and handles preemption when GPU memory is exhausted. See Scheduler for details.

BlockManager

BlockManager owns the KV cache address space. It tracks free and used block IDs, assigns physical blocks to sequences at prefill time, shares blocks across sequences that share a common prefix via hash-based deduplication, and releases blocks when a sequence finishes or is preempted. See KV Cache Management for details.

ModelRunner

ModelRunner runs on each GPU rank. It initialises the NCCL process group, loads the model weights, warms up to measure peak memory, allocates the physical KV cache tensor, and (unless enforce_eager=True) captures a set of CUDA graphs for decode batch sizes [1, 2, 4, 8, 16, 32, …, 512]. See Model Runner for details.

Sequence

Sequence is the per-request data object. It holds the full list of token IDs (prompt + completion), a block_table mapping logical block indices to physical KV cache block IDs, a status flag (WAITING / RUNNING / FINISHED), and sampling parameters.

class Sequence:
    block_size = 256

    def __init__(self, token_ids: list[int], sampling_params=SamplingParams()):
        self.seq_id = next(Sequence.counter)
        self.status = SequenceStatus.WAITING
        self.token_ids = copy(token_ids)
        self.num_cached_tokens = 0
        self.block_table = []
        self.temperature = sampling_params.temperature
        self.max_tokens = sampling_params.max_tokens
        self.ignore_eos = sampling_params.ignore_eos

Qwen3ForCausalLM & Sampler

Qwen3ForCausalLM is the transformer model. Its forward() pass returns hidden states; compute_logits() projects them through the language model head. The Sampler turns logits into the next token ID using per-sequence temperatures.

Request Lifecycle

add_request

LLMEngine.add_request() tokenizes the prompt string (or accepts a pre-tokenized list) and wraps it in a Sequence with status WAITING. The sequence is pushed onto Scheduler.waiting.

schedule

On the next call to step(), Scheduler.schedule() attempts a prefill batch first. It pops sequences from waiting, checks the token budget (max_num_batched_tokens) and block availability, calls BlockManager.allocate() for each sequence, marks them RUNNING, and returns them together with is_prefill=True. If no waiting sequences are ready, it assembles a decode batch from running.

run

ModelRunner.call("run", seqs, is_prefill) prepares input tensors — token IDs, positions, slot mappings, block tables — and executes the model. For prefill it calls the model directly via flash_attn_varlen_func; for decode it replays a CUDA graph (unless enforce_eager=True).

postprocess

Scheduler.postprocess() appends the new token ID to each sequence and checks stop conditions: EOS token match (unless ignore_eos) or reaching max_tokens. Finished sequences are marked FINISHED and their KV cache blocks are released.

output

step() collects all FINISHED sequences from the batch, returns (seq_id, completion_token_ids) pairs to generate(), which accumulates them. Once is_finished() is true, the completion token IDs are decoded to strings and returned.

Detailed Pages

Scheduler

Prefill and decode scheduling, preemption strategy, and the postprocess loop.

KV Cache Management

Block allocation, hash-based prefix caching, and reference counting.

Model Runner

GPU setup, tensor parallelism, CUDA graph capture, and inference execution paths.

Get Started

Guides

Architecture

Architecture Overview

Component Map

Components

LLMEngine

Scheduler

BlockManager

ModelRunner

Sequence

Qwen3ForCausalLM & Sampler

Request Lifecycle

Detailed Pages

Scheduler

KV Cache Management

Model Runner

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​Component Map

​Components

​LLMEngine

​Scheduler

​BlockManager

​ModelRunner

​Sequence

​Qwen3ForCausalLM & Sampler

​Request Lifecycle

​Detailed Pages

Scheduler

KV Cache Management

Model Runner

Build docs developers (and LLMs) love

Component Map

Components

LLMEngine

Scheduler

BlockManager

ModelRunner

Sequence

Qwen3ForCausalLM & Sampler

Request Lifecycle

Detailed Pages