Skip to main content
The Scheduler sits between LLMEngine and BlockManager. Each call to schedule() returns a list of sequences and a boolean flag indicating whether the batch is a prefill or a decode step. The engine drives this loop without any concept of request priority — sequences are served in arrival order.

Queues

The scheduler maintains exactly two queues:
QueueTypeContents
waitingdeque[Sequence]Sequences not yet assigned KV cache blocks
runningdeque[Sequence]Sequences with allocated blocks, actively being decoded
New sequences enter waiting via Scheduler.add(). They graduate to running when a prefill batch is assembled. Preempted sequences move back from running to the front of waiting.

Scheduling Logic

Nano-vLLM uses continuous batching: every call to step() assembles a fresh batch. There are no fixed batch slots. Sequences that finish mid-flight free their slots immediately, and new sequences can fill them on the very next step.

Prefill Phase

Prefill is attempted first. The scheduler walks waiting from the front and greedily accumulates sequences until either the token budget or the sequence count limit is reached, or BlockManager cannot provide enough free blocks.
def schedule(self) -> tuple[list[Sequence], bool]:
    # prefill
    scheduled_seqs = []
    num_seqs = 0
    num_batched_tokens = 0
    while self.waiting and num_seqs < self.max_num_seqs:
        seq = self.waiting[0]
        if num_batched_tokens + len(seq) > self.max_num_batched_tokens \
                or not self.block_manager.can_allocate(seq):
            break
        num_seqs += 1
        self.block_manager.allocate(seq)
        num_batched_tokens += len(seq) - seq.num_cached_tokens
        seq.status = SequenceStatus.RUNNING
        self.waiting.popleft()
        self.running.append(seq)
        scheduled_seqs.append(seq)
    if scheduled_seqs:
        return scheduled_seqs, True
Key constraints checked before admitting each sequence:
  • num_batched_tokens + len(seq) <= max_num_batched_tokens — prevents a single step from processing more tokens than the model was warmed up for (default 16 384).
  • block_manager.can_allocate(seq) — verifies that len(free_block_ids) >= seq.num_blocks.
  • num_seqs < max_num_seqs — caps concurrency (default 512).
Note that num_batched_tokens is charged only for the uncached portion of the prompt (len(seq) - seq.num_cached_tokens), so prefix cache hits reduce the effective cost of admitting a sequence.

Decode Phase

If no sequences were admitted for prefill, the scheduler builds a decode batch from running. Each sequence needs one new KV cache slot (one token per step). can_append() returns True when no new block is needed, or when at least one free block exists for the one case where the current block is full:
def can_append(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= (len(seq) % self.block_size == 1)
The expression len(seq) % self.block_size == 1 evaluates to 1 (truthy) exactly when the sequence length is one past a block boundary — i.e., it is the first token of a new block — and 0 otherwise.
    # decode
    while self.running and num_seqs < self.max_num_seqs:
        seq = self.running.popleft()
        while not self.block_manager.can_append(seq):
            if self.running:
                self.preempt(self.running.pop())
            else:
                self.preempt(seq)
                break
        else:
            num_seqs += 1
            self.block_manager.may_append(seq)
            scheduled_seqs.append(seq)
    assert scheduled_seqs
    self.running.extendleft(reversed(scheduled_seqs))
    return scheduled_seqs, False

Preemption

When a decode sequence needs a new block but none are free, the scheduler preempts sequences from the back of running (youngest first) until enough blocks are recovered. The preempted sequence’s blocks are deallocated and it is pushed to the front of waiting so it will be re-prefilled on the next available step.
def preempt(self, seq: Sequence):
    seq.status = SequenceStatus.WAITING
    self.block_manager.deallocate(seq)
    self.waiting.appendleft(seq)
Preemption discards the sequence’s KV cache. When the sequence is re-admitted it will be re-prefilled from the beginning (or from the furthest prefix cache hit). Long decode runs under heavy concurrency can trigger repeated preemption cycles.
The inner while not can_append loop handles the edge case where the sequence being scheduled is itself the one that needs preempting:
while not self.block_manager.can_append(seq):
    if self.running:
        self.preempt(self.running.pop())   # preempt youngest other sequence
    else:
        self.preempt(seq)                  # no other choice: preempt self
        break
The for/else (while/else in Python) ensures the scheduled_seqs.append(seq) path is only taken when can_append eventually returns True.

Postprocessing

After the model runner returns a token ID for each sequence in the batch, postprocess() appends it and evaluates stop conditions:
def postprocess(self, seqs: list[Sequence], token_ids: list[int]) -> list[bool]:
    for seq, token_id in zip(seqs, token_ids):
        seq.append_token(token_id)
        if (not seq.ignore_eos and token_id == self.eos) \
                or seq.num_completion_tokens == seq.max_tokens:
            seq.status = SequenceStatus.FINISHED
            self.block_manager.deallocate(seq)
            self.running.remove(seq)
Two stop conditions are evaluated:
  1. EOS match — the generated token equals tokenizer.eos_token_id and ignore_eos is False.
  2. Length limitnum_completion_tokens == max_tokens (from SamplingParams).
When either fires the sequence is marked FINISHED, its KV cache blocks are returned to the free pool, and it is removed from running. The engine’s step() then surfaces it as a completed output.

Build docs developers (and LLMs) love