Scheduler

Scheduler decides which sequences to run on each forward pass and manages the allocation of KV cache blocks through BlockManager. It maintains two queues — waiting and running — and alternates between prefill batches (promoting sequences from waiting) and decode batches (stepping running sequences by one token).

Constructor

Scheduler(
    max_num_sequences: int,
    max_num_batched_tokens: int,
    max_cached_blocks: int,
    block_size: int,
    eos: int,
)

max_num_sequences

int

required

Maximum number of sequences that can be scheduled in a single batch, whether during prefill or decode.

max_num_batched_tokens

int

required

Maximum total tokens across all sequences in one batch. Prefill counts the full prompt length; decode counts one token per sequence.

max_cached_blocks

int

required

Total number of KV cache blocks available in the pool. Passed directly to BlockManager. At runtime this is overridden by ModelRunner.allocate_kv_cache() to the actual GPU-measured value.

block_size

int

required

Number of tokens per KV cache block. Must match the value used by ModelRunner.

eos

int

required

EOS token ID used to detect end-of-sequence during postprocessing.

Queue model

The scheduler maintains two deque objects:

waiting — sequences that have been added but not yet allocated KV cache blocks.
running — sequences that own KV cache blocks and are actively being decoded.

On each call to schedule(), the scheduler first tries to promote sequences from waiting to running (prefill). Only if no waiting sequence can be admitted does it then schedule a decode step over the sequences in running.

Methods

`add_sequence`

scheduler.add_sequence(sequence: Sequence) -> None

Appends a Sequence object to the end of the waiting queue. Called by LLMEngine.add_prompt() after tokenization.

sequence

Sequence

required

A fully constructed Sequence object. The sequence must already have its token IDs and SamplingParams fields set.

`schedule`

scheduler.schedule() -> tuple[list[Sequence], bool]

Selects sequences for the next batch and returns them together with a flag indicating whether the batch is a prefill or a decode. Prefill path — iterates the waiting queue front-to-back:

Checks BlockManager.can_allocate(seq) and the max_num_batched_tokens / max_num_sequences limits.
Allocates KV blocks, sets seq.status = RUNNING, and moves the sequence to running.
Stops as soon as any limit is hit or the queue is empty.
Returns (scheduled_sequences, True) if any sequences were admitted.

Decode path — only reached when no sequence was admitted from waiting:

Iterates the running queue and checks BlockManager.can_append(seq).
If a running sequence cannot be appended (no free block for the next token), it is preempted: the least-recently-scheduled running sequence is moved back to waiting.
Stops when max_num_batched_tokens or max_num_sequences is reached.
Returns (scheduled_sequences, False).

Return value

scheduled_sequences

list[Sequence]

Sequences selected for this forward pass. Empty when both queues are empty.

is_prefill

bool

True when the batch is a prefill pass; False for a decode pass.

`postprocess`

scheduler.postprocess(seqs: list[Sequence], token_ids: list[int]) -> None

Called after ModelRunner.run() returns sampled token IDs. For each (seq, token_id) pair:

Appends token_id to the sequence’s token list.
Evaluates the three stopping conditions below.
If any condition is met, sets seq.status = FINISHED, deallocates its KV blocks, and removes it from the running queue.

Stopping conditions

Condition	Triggered when
EOS token	`token_id == eos` and `seq.ignore_eos` is `False`
Max completion tokens	`seq.num_completion_tokens >= seq.max_tokens`
Max model length	`seq.max_model_length is not None` and `seq.num_tokens >= seq.max_model_length`

seqs

list[Sequence]

required

The same list returned by the preceding schedule() call.

token_ids

list[int]

required

One sampled token ID per sequence, in the same order as seqs.

`preempt`

scheduler.preempt(seq: Sequence) -> None

Moves a running sequence back to the front of the waiting queue. Its KV cache blocks are deallocated so they can be used by other sequences. When the sequence is rescheduled later it will go through prefill again.

seq

Sequence

required

A sequence currently in the running queue.

`is_finished`

scheduler.is_finished() -> bool

Returns True when both the waiting and running queues are empty, indicating that all submitted sequences have completed generation.

Example

from myvllm.engine.scheduler import Scheduler
from myvllm.engine.sequence import Sequence
from myvllm.sampling_parameters import SamplingParams

scheduler = Scheduler(
    max_num_sequences=8,
    max_num_batched_tokens=512,
    max_cached_blocks=256,
    block_size=16,
    eos=151645,
)

params = SamplingParams(temperature=0.7, max_tokens=32)
seq = Sequence(token_ids=[1, 2, 3, 4], block_size=16, sampling_params=params)
scheduler.add_sequence(seq)

while not scheduler.is_finished():
    scheduled, is_prefill = scheduler.schedule()
    # ... run model and obtain token_ids ...
    token_ids = [42] * len(scheduled)  # placeholder
    scheduler.postprocess(scheduled, token_ids)

Engine

Layers & Models

Constructor

Queue model

Methods

`add_sequence`

`schedule`

Return value

`postprocess`

Stopping conditions

`preempt`

`is_finished`

Example

Build docs developers (and LLMs) love

Engine

Layers & Models

Documentation Index

​Constructor

​Queue model

​Methods

​add_sequence

​schedule

​Return value

​postprocess

​Stopping conditions

​preempt

​is_finished

​Example

Build docs developers (and LLMs) love

Constructor

Queue model

Methods

`add_sequence`

`schedule`

Return value

`postprocess`

Stopping conditions

`preempt`

`is_finished`

Example