Model Runner

ModelRunner is the lowest layer of nano-vLLM. One instance runs on each GPU rank. Rank 0 lives in the main process and is driven directly by LLMEngine; ranks 1+ are spawned as separate processes and enter a command loop that mirrors whatever rank 0 does.

Initialization Sequence

Distributed init

Every rank calls dist.init_process_group("nccl", "tcp://localhost:2333", ...) and torch.cuda.set_device(rank). The rendezvous address is fixed at localhost:2333.

Model load

Qwen3ForCausalLM is instantiated on the GPU with the model’s native dtype and weights are loaded via load_model(). Tensor-parallel layers partition their weight matrices column- or row-wise across ranks automatically.

Warmup

warmup_model() runs a synthetic prefill with max_num_seqs sequences of length max_model_len to bring GPU memory to its steady-state peak:

def warmup_model(self):
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    max_num_batched_tokens = self.config.max_num_batched_tokens
    max_model_len = self.config.max_model_len
    num_seqs = min(max_num_batched_tokens // max_model_len, self.config.max_num_seqs)
    seqs = [Sequence([0] * max_model_len) for _ in range(num_seqs)]
    self.run(seqs, True)
    torch.cuda.empty_cache()

The peak memory stat recorded here is used by allocate_kv_cache() to know how much VRAM the model itself consumes.

KV cache allocation

allocate_kv_cache() computes how many KV cache blocks fit in the remaining GPU memory budget and allocates the cache tensor. See KV Cache Management for the calculation.

CUDA graph capture

Unless enforce_eager=True, capture_cudagraph() records a CUDA graph for each decode batch size in [1, 2, 4, 8, 16, 32, 48, …, 512]. All graphs share a single memory pool to minimise overhead.

Tensor Parallelism

Nano-vLLM implements tensor parallelism by splitting attention heads and MLP weight matrices across GPUs. Head counts and weight dimensions are divided by tensor_parallel_size inside each parallel layer.

Process Architecture

LLMEngine spawns tensor_parallel_size - 1 worker processes before creating rank 0’s ModelRunner:

ctx = mp.get_context("spawn")
for i in range(1, config.tensor_parallel_size):
    event = ctx.Event()
    process = ctx.Process(target=ModelRunner, args=(config, i, event))
    process.start()
    self.ps.append(process)
    self.events.append(event)
self.model_runner = ModelRunner(config, 0, self.events)

IPC via SharedMemory

Commands from rank 0 to worker ranks travel through a POSIX shared memory segment named "nanovllm" (1 MiB). Rank 0 serialises [method_name, *args] with pickle and signals each worker via a multiprocessing.Event:

def write_shm(self, method_name, *args):
    data = pickle.dumps([method_name, *args])
    n = len(data)
    self.shm.buf[0:4] = n.to_bytes(4, "little")
    self.shm.buf[4:n+4] = data
    for event in self.event:
        event.set()

Worker ranks block on event.wait(), read the shared memory, and call the same method locally:

def loop(self):
    while True:
        method_name, args = self.read_shm()
        self.call(method_name, *args)
        if method_name == "exit":
            break

NCCL collective operations (all-reduce for tensor-parallel layers) synchronise the actual computation in-kernel without going through the shared memory channel.

The shared memory channel carries only control messages (method name + sequence metadata). The heavy lifting — gradient-free tensor operations — is handled by NCCL directly between GPUs.

Prefill Execution Path

prepare_prefill() builds variable-length input tensors for flash_attn_varlen_func. Each sequence contributes only its uncached tokens to the query, but the full sequence length to the key/value side (to allow attending over cached prefix blocks):

def prepare_prefill(self, seqs: list[Sequence]):
    input_ids = []
    positions = []
    cu_seqlens_q = [0]
    cu_seqlens_k = [0]
    slot_mapping = []
    block_tables = None
    for seq in seqs:
        seqlen = len(seq)
        input_ids.extend(seq[seq.num_cached_tokens:])
        positions.extend(list(range(seq.num_cached_tokens, seqlen)))
        seqlen_q = seqlen - seq.num_cached_tokens
        seqlen_k = seqlen
        cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q)
        cu_seqlens_k.append(cu_seqlens_k[-1] + seqlen_k)
        ...
    if cu_seqlens_k[-1] > cu_seqlens_q[-1]:    # prefix cache
        block_tables = self.prepare_block_tables(seqs)

The attention layer (Attention.forward) dispatches to flash_attn_varlen_func with the cumulative sequence length tensors and an optional block_table for prefix-cached sequences.

Decode Execution Path

prepare_decode() is much simpler: one token per sequence, a scalar position, and a slot mapping that points to the single new KV slot:

def prepare_decode(self, seqs: list[Sequence]):
    input_ids = []
    positions = []
    slot_mapping = []
    context_lens = []
    for seq in seqs:
        input_ids.append(seq.last_token)
        positions.append(len(seq) - 1)
        context_lens.append(len(seq))
        slot_mapping.append(
            seq.block_table[-1] * self.block_size + seq.last_block_num_tokens - 1
        )

The attention layer then calls flash_attn_with_kvcache with the block table and per-sequence context lengths.

CUDA Graph Capture

CUDA graphs eliminate CPU-side kernel launch overhead for decode, which dominates at small batch sizes. capture_cudagraph() iterates from largest to smallest batch size so that the first graph establishes the memory pool that all subsequent graphs share:

@torch.inference_mode()
def capture_cudagraph(self):
    config = self.config
    hf_config = config.hf_config
    max_bs = min(self.config.max_num_seqs, 512)
    max_num_blocks = (config.max_model_len + self.block_size - 1) // self.block_size
    input_ids    = torch.zeros(max_bs, dtype=torch.int64)
    positions    = torch.zeros(max_bs, dtype=torch.int64)
    slot_mapping = torch.zeros(max_bs, dtype=torch.int32)
    context_lens = torch.zeros(max_bs, dtype=torch.int32)
    block_tables = torch.zeros(max_bs, max_num_blocks, dtype=torch.int32)
    outputs      = torch.zeros(max_bs, hf_config.hidden_size)
    self.graph_bs = [1, 2, 4, 8] + list(range(16, max_bs + 1, 16))
    self.graphs = {}
    self.graph_pool = None

    for bs in reversed(self.graph_bs):
        graph = torch.cuda.CUDAGraph()
        set_context(False, slot_mapping=slot_mapping[:bs],
                    context_lens=context_lens[:bs],
                    block_tables=block_tables[:bs])
        outputs[:bs] = self.model(input_ids[:bs], positions[:bs])  # warmup
        with torch.cuda.graph(graph, self.graph_pool):
            outputs[:bs] = self.model(input_ids[:bs], positions[:bs])  # capture
        if self.graph_pool is None:
            self.graph_pool = graph.pool()
        self.graphs[bs] = graph
        torch.cuda.synchronize()
        reset_context()

At inference time, run_model() finds the smallest captured batch size that accommodates the current request count, copies inputs into the pre-allocated graph tensors, and replays:

@torch.inference_mode()
def run_model(self, input_ids: torch.Tensor, positions: torch.Tensor, is_prefill: bool):
    if is_prefill or self.enforce_eager or input_ids.size(0) > 512:
        return self.model.compute_logits(self.model(input_ids, positions))
    else:
        bs = input_ids.size(0)
        context = get_context()
        graph = self.graphs[next(x for x in self.graph_bs if x >= bs)]
        graph_vars = self.graph_vars
        graph_vars["input_ids"][:bs] = input_ids
        graph_vars["positions"][:bs] = positions
        graph_vars["slot_mapping"].fill_(-1)
        graph_vars["slot_mapping"][:bs] = context.slot_mapping
        graph_vars["context_lens"].zero_()
        graph_vars["context_lens"][:bs] = context.context_lens
        graph_vars["block_tables"][:bs, :context.block_tables.size(1)] = context.block_tables
        graph.replay()
        return self.model.compute_logits(graph_vars["outputs"][:bs])

Four conditions bypass graph replay and fall back to eager execution:

The batch is a prefill step (is_prefill=True).
enforce_eager=True was set in the config.
Batch size exceeds 512 (no graph was captured for it).

Use enforce_eager=True when debugging or when running on hardware where CUDA graph capture is unreliable. It removes the graph replay path entirely and runs every step through the standard PyTorch eager executor, at the cost of higher per-step latency for decode.

run() Entry Point

The public method called by LLMEngine.step() ties the two paths together:

def run(self, seqs: list[Sequence], is_prefill: bool) -> list[int]:
    input_ids, positions = (
        self.prepare_prefill(seqs) if is_prefill else self.prepare_decode(seqs)
    )
    temperatures = self.prepare_sample(seqs) if self.rank == 0 else None
    logits = self.run_model(input_ids, positions, is_prefill)
    token_ids = self.sampler(logits, temperatures).tolist() if self.rank == 0 else None
    reset_context()
    return token_ids

Only rank 0 collects temperatures and runs the sampler; worker ranks execute the model in lock-step but discard their logits. Rank 0 returns the sampled token IDs to the scheduler.

Get Started

Guides

Architecture

Initialization Sequence

Tensor Parallelism

Process Architecture

IPC via SharedMemory

Prefill Execution Path

Decode Execution Path

CUDA Graph Capture

run() Entry Point

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​Initialization Sequence

​Tensor Parallelism

​Process Architecture

​IPC via SharedMemory

​Prefill Execution Path

​Decode Execution Path

​CUDA Graph Capture

​run() Entry Point

Build docs developers (and LLMs) love

Initialization Sequence

Tensor Parallelism

Process Architecture

IPC via SharedMemory

Prefill Execution Path

Decode Execution Path

CUDA Graph Capture

run() Entry Point