Component Map
Components
LLMEngine
LLMEngine is the central orchestrator. On construction it spawns one worker process per tensor-parallel rank beyond rank 0, creates the ModelRunner for rank 0 in the main process, loads the tokenizer, and wires up the Scheduler. The generate() loop repeatedly calls step(), which in turn calls scheduler.schedule(), dispatches to the model runner, and hands new token IDs back to scheduler.postprocess().
Scheduler
Scheduler maintains two deque objects — waiting and running — and decides each step whether to run a prefill batch or a decode batch. It delegates memory decisions to BlockManager and handles preemption when GPU memory is exhausted. See Scheduler for details.
BlockManager
BlockManager owns the KV cache address space. It tracks free and used block IDs, assigns physical blocks to sequences at prefill time, shares blocks across sequences that share a common prefix via hash-based deduplication, and releases blocks when a sequence finishes or is preempted. See KV Cache Management for details.
ModelRunner
ModelRunner runs on each GPU rank. It initialises the NCCL process group, loads the model weights, warms up to measure peak memory, allocates the physical KV cache tensor, and (unless enforce_eager=True) captures a set of CUDA graphs for decode batch sizes [1, 2, 4, 8, 16, 32, …, 512]. See Model Runner for details.
Sequence
Sequence is the per-request data object. It holds the full list of token IDs (prompt + completion), a block_table mapping logical block indices to physical KV cache block IDs, a status flag (WAITING / RUNNING / FINISHED), and sampling parameters.
Qwen3ForCausalLM & Sampler
Qwen3ForCausalLM is the transformer model. Its forward() pass returns hidden states; compute_logits() projects them through the language model head. The Sampler turns logits into the next token ID using per-sequence temperatures.
Request Lifecycle
add_request
LLMEngine.add_request() tokenizes the prompt string (or accepts a pre-tokenized list) and wraps it in a Sequence with status WAITING. The sequence is pushed onto Scheduler.waiting.schedule
On the next call to
step(), Scheduler.schedule() attempts a prefill batch first. It pops sequences from waiting, checks the token budget (max_num_batched_tokens) and block availability, calls BlockManager.allocate() for each sequence, marks them RUNNING, and returns them together with is_prefill=True. If no waiting sequences are ready, it assembles a decode batch from running.run
ModelRunner.call("run", seqs, is_prefill) prepares input tensors — token IDs, positions, slot mappings, block tables — and executes the model. For prefill it calls the model directly via flash_attn_varlen_func; for decode it replays a CUDA graph (unless enforce_eager=True).postprocess
Scheduler.postprocess() appends the new token ID to each sequence and checks stop conditions: EOS token match (unless ignore_eos) or reaching max_tokens. Finished sequences are marked FINISHED and their KV cache blocks are released.Detailed Pages
Scheduler
Prefill and decode scheduling, preemption strategy, and the
postprocess loop.KV Cache Management
Block allocation, hash-based prefix caching, and reference counting.
Model Runner
GPU setup, tensor parallelism, CUDA graph capture, and inference execution paths.