Scheduler sits between LLMEngine and BlockManager. Each call to schedule() returns a list of sequences and a boolean flag indicating whether the batch is a prefill or a decode step. The engine drives this loop without any concept of request priority — sequences are served in arrival order.
Queues
The scheduler maintains exactly two queues:| Queue | Type | Contents |
|---|---|---|
waiting | deque[Sequence] | Sequences not yet assigned KV cache blocks |
running | deque[Sequence] | Sequences with allocated blocks, actively being decoded |
waiting via Scheduler.add(). They graduate to running when a prefill batch is assembled. Preempted sequences move back from running to the front of waiting.
Scheduling Logic
Nano-vLLM uses continuous batching: every call to
step() assembles a fresh batch. There are no fixed batch slots. Sequences that finish mid-flight free their slots immediately, and new sequences can fill them on the very next step.Prefill Phase
Prefill is attempted first. The scheduler walkswaiting from the front and greedily accumulates sequences until either the token budget or the sequence count limit is reached, or BlockManager cannot provide enough free blocks.
num_batched_tokens + len(seq) <= max_num_batched_tokens— prevents a single step from processing more tokens than the model was warmed up for (default 16 384).block_manager.can_allocate(seq)— verifies thatlen(free_block_ids) >= seq.num_blocks.num_seqs < max_num_seqs— caps concurrency (default 512).
num_batched_tokens is charged only for the uncached portion of the prompt (len(seq) - seq.num_cached_tokens), so prefix cache hits reduce the effective cost of admitting a sequence.
Decode Phase
If no sequences were admitted for prefill, the scheduler builds a decode batch fromrunning. Each sequence needs one new KV cache slot (one token per step). can_append() returns True when no new block is needed, or when at least one free block exists for the one case where the current block is full:
len(seq) % self.block_size == 1 evaluates to 1 (truthy) exactly when the sequence length is one past a block boundary — i.e., it is the first token of a new block — and 0 otherwise.
Preemption
When a decode sequence needs a new block but none are free, the scheduler preempts sequences from the back ofrunning (youngest first) until enough blocks are recovered. The preempted sequence’s blocks are deallocated and it is pushed to the front of waiting so it will be re-prefilled on the next available step.
while not can_append loop handles the edge case where the sequence being scheduled is itself the one that needs preempting:
for/else (while/else in Python) ensures the scheduled_seqs.append(seq) path is only taken when can_append eventually returns True.
Postprocessing
After the model runner returns a token ID for each sequence in the batch,postprocess() appends it and evaluates stop conditions:
- EOS match — the generated token equals
tokenizer.eos_token_idandignore_eosisFalse. - Length limit —
num_completion_tokens == max_tokens(fromSamplingParams).
FINISHED, its KV cache blocks are returned to the free pool, and it is removed from running. The engine’s step() then surfaces it as a completed output.