ModelRunner is the lowest layer of nano-vLLM. One instance runs on each GPU rank. Rank 0 lives in the main process and is driven directly by LLMEngine; ranks 1+ are spawned as separate processes and enter a command loop that mirrors whatever rank 0 does.
Initialization Sequence
Distributed init
Every rank calls dist.init_process_group("nccl", "tcp://localhost:2333", ...) and torch.cuda.set_device(rank). The rendezvous address is fixed at localhost:2333.
Model load
Qwen3ForCausalLM is instantiated on the GPU with the model’s native dtype and weights are loaded via load_model(). Tensor-parallel layers partition their weight matrices column- or row-wise across ranks automatically.
Warmup
warmup_model() runs a synthetic prefill with max_num_seqs sequences of length max_model_len to bring GPU memory to its steady-state peak:def warmup_model(self):
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
max_num_batched_tokens = self.config.max_num_batched_tokens
max_model_len = self.config.max_model_len
num_seqs = min(max_num_batched_tokens // max_model_len, self.config.max_num_seqs)
seqs = [Sequence([0] * max_model_len) for _ in range(num_seqs)]
self.run(seqs, True)
torch.cuda.empty_cache()
The peak memory stat recorded here is used by allocate_kv_cache() to know how much VRAM the model itself consumes.KV cache allocation
allocate_kv_cache() computes how many KV cache blocks fit in the remaining GPU memory budget and allocates the cache tensor. See KV Cache Management for the calculation.CUDA graph capture
Unless enforce_eager=True, capture_cudagraph() records a CUDA graph for each decode batch size in [1, 2, 4, 8, 16, 32, 48, …, 512]. All graphs share a single memory pool to minimise overhead.
Tensor Parallelism
Nano-vLLM implements tensor parallelism by splitting attention heads and MLP weight matrices across GPUs. Head counts and weight dimensions are divided by tensor_parallel_size inside each parallel layer.
Process Architecture
LLMEngine spawns tensor_parallel_size - 1 worker processes before creating rank 0’s ModelRunner:
ctx = mp.get_context("spawn")
for i in range(1, config.tensor_parallel_size):
event = ctx.Event()
process = ctx.Process(target=ModelRunner, args=(config, i, event))
process.start()
self.ps.append(process)
self.events.append(event)
self.model_runner = ModelRunner(config, 0, self.events)
IPC via SharedMemory
Commands from rank 0 to worker ranks travel through a POSIX shared memory segment named "nanovllm" (1 MiB). Rank 0 serialises [method_name, *args] with pickle and signals each worker via a multiprocessing.Event:
def write_shm(self, method_name, *args):
data = pickle.dumps([method_name, *args])
n = len(data)
self.shm.buf[0:4] = n.to_bytes(4, "little")
self.shm.buf[4:n+4] = data
for event in self.event:
event.set()
Worker ranks block on event.wait(), read the shared memory, and call the same method locally:
def loop(self):
while True:
method_name, args = self.read_shm()
self.call(method_name, *args)
if method_name == "exit":
break
NCCL collective operations (all-reduce for tensor-parallel layers) synchronise the actual computation in-kernel without going through the shared memory channel.
The shared memory channel carries only control messages (method name + sequence metadata). The heavy lifting — gradient-free tensor operations — is handled by NCCL directly between GPUs.
Prefill Execution Path
prepare_prefill() builds variable-length input tensors for flash_attn_varlen_func. Each sequence contributes only its uncached tokens to the query, but the full sequence length to the key/value side (to allow attending over cached prefix blocks):
def prepare_prefill(self, seqs: list[Sequence]):
input_ids = []
positions = []
cu_seqlens_q = [0]
cu_seqlens_k = [0]
slot_mapping = []
block_tables = None
for seq in seqs:
seqlen = len(seq)
input_ids.extend(seq[seq.num_cached_tokens:])
positions.extend(list(range(seq.num_cached_tokens, seqlen)))
seqlen_q = seqlen - seq.num_cached_tokens
seqlen_k = seqlen
cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q)
cu_seqlens_k.append(cu_seqlens_k[-1] + seqlen_k)
...
if cu_seqlens_k[-1] > cu_seqlens_q[-1]: # prefix cache
block_tables = self.prepare_block_tables(seqs)
The attention layer (Attention.forward) dispatches to flash_attn_varlen_func with the cumulative sequence length tensors and an optional block_table for prefix-cached sequences.
Decode Execution Path
prepare_decode() is much simpler: one token per sequence, a scalar position, and a slot mapping that points to the single new KV slot:
def prepare_decode(self, seqs: list[Sequence]):
input_ids = []
positions = []
slot_mapping = []
context_lens = []
for seq in seqs:
input_ids.append(seq.last_token)
positions.append(len(seq) - 1)
context_lens.append(len(seq))
slot_mapping.append(
seq.block_table[-1] * self.block_size + seq.last_block_num_tokens - 1
)
The attention layer then calls flash_attn_with_kvcache with the block table and per-sequence context lengths.
CUDA Graph Capture
CUDA graphs eliminate CPU-side kernel launch overhead for decode, which dominates at small batch sizes. capture_cudagraph() iterates from largest to smallest batch size so that the first graph establishes the memory pool that all subsequent graphs share:
@torch.inference_mode()
def capture_cudagraph(self):
config = self.config
hf_config = config.hf_config
max_bs = min(self.config.max_num_seqs, 512)
max_num_blocks = (config.max_model_len + self.block_size - 1) // self.block_size
input_ids = torch.zeros(max_bs, dtype=torch.int64)
positions = torch.zeros(max_bs, dtype=torch.int64)
slot_mapping = torch.zeros(max_bs, dtype=torch.int32)
context_lens = torch.zeros(max_bs, dtype=torch.int32)
block_tables = torch.zeros(max_bs, max_num_blocks, dtype=torch.int32)
outputs = torch.zeros(max_bs, hf_config.hidden_size)
self.graph_bs = [1, 2, 4, 8] + list(range(16, max_bs + 1, 16))
self.graphs = {}
self.graph_pool = None
for bs in reversed(self.graph_bs):
graph = torch.cuda.CUDAGraph()
set_context(False, slot_mapping=slot_mapping[:bs],
context_lens=context_lens[:bs],
block_tables=block_tables[:bs])
outputs[:bs] = self.model(input_ids[:bs], positions[:bs]) # warmup
with torch.cuda.graph(graph, self.graph_pool):
outputs[:bs] = self.model(input_ids[:bs], positions[:bs]) # capture
if self.graph_pool is None:
self.graph_pool = graph.pool()
self.graphs[bs] = graph
torch.cuda.synchronize()
reset_context()
At inference time, run_model() finds the smallest captured batch size that accommodates the current request count, copies inputs into the pre-allocated graph tensors, and replays:
@torch.inference_mode()
def run_model(self, input_ids: torch.Tensor, positions: torch.Tensor, is_prefill: bool):
if is_prefill or self.enforce_eager or input_ids.size(0) > 512:
return self.model.compute_logits(self.model(input_ids, positions))
else:
bs = input_ids.size(0)
context = get_context()
graph = self.graphs[next(x for x in self.graph_bs if x >= bs)]
graph_vars = self.graph_vars
graph_vars["input_ids"][:bs] = input_ids
graph_vars["positions"][:bs] = positions
graph_vars["slot_mapping"].fill_(-1)
graph_vars["slot_mapping"][:bs] = context.slot_mapping
graph_vars["context_lens"].zero_()
graph_vars["context_lens"][:bs] = context.context_lens
graph_vars["block_tables"][:bs, :context.block_tables.size(1)] = context.block_tables
graph.replay()
return self.model.compute_logits(graph_vars["outputs"][:bs])
Four conditions bypass graph replay and fall back to eager execution:
- The batch is a prefill step (
is_prefill=True).
enforce_eager=True was set in the config.
- Batch size exceeds 512 (no graph was captured for it).
Use enforce_eager=True when debugging or when running on hardware where CUDA graph capture is unreliable. It removes the graph replay path entirely and runs every step through the standard PyTorch eager executor, at the cost of higher per-step latency for decode.
run() Entry Point
The public method called by LLMEngine.step() ties the two paths together:
def run(self, seqs: list[Sequence], is_prefill: bool) -> list[int]:
input_ids, positions = (
self.prepare_prefill(seqs) if is_prefill else self.prepare_decode(seqs)
)
temperatures = self.prepare_sample(seqs) if self.rank == 0 else None
logits = self.run_model(input_ids, positions, is_prefill)
token_ids = self.sampler(logits, temperatures).tolist() if self.rank == 0 else None
reset_context()
return token_ids
Only rank 0 collects temperatures and runs the sampler; worker ranks execute the model in lock-step but discard their logits. Rank 0 returns the sampled token IDs to the scheduler.