Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
Scheduler decides which sequences to run on each forward pass and manages the allocation of KV cache blocks through BlockManager. It maintains two queues — waiting and running — and alternates between prefill batches (promoting sequences from waiting) and decode batches (stepping running sequences by one token).
Constructor
Maximum number of sequences that can be scheduled in a single batch, whether
during prefill or decode.
Maximum total tokens across all sequences in one batch. Prefill counts the
full prompt length; decode counts one token per sequence.
Total number of KV cache blocks available in the pool. Passed directly to
BlockManager. At runtime this is overridden by ModelRunner.allocate_kv_cache()
to the actual GPU-measured value.Number of tokens per KV cache block. Must match the value used by
ModelRunner.EOS token ID used to detect end-of-sequence during postprocessing.
Queue model
The scheduler maintains twodeque objects:
waiting— sequences that have been added but not yet allocated KV cache blocks.running— sequences that own KV cache blocks and are actively being decoded.
schedule(), the scheduler first tries to promote sequences from waiting to running (prefill). Only if no waiting sequence can be admitted does it then schedule a decode step over the sequences in running.
Methods
add_sequence
Sequence object to the end of the waiting queue. Called by LLMEngine.add_prompt() after tokenization.
A fully constructed
Sequence object. The sequence must already have its
token IDs and SamplingParams fields set.schedule
waiting queue front-to-back:
- Checks
BlockManager.can_allocate(seq)and themax_num_batched_tokens/max_num_sequenceslimits. - Allocates KV blocks, sets
seq.status = RUNNING, and moves the sequence torunning. - Stops as soon as any limit is hit or the queue is empty.
- Returns
(scheduled_sequences, True)if any sequences were admitted.
waiting:
- Iterates the
runningqueue and checksBlockManager.can_append(seq). - If a running sequence cannot be appended (no free block for the next token), it is preempted: the least-recently-scheduled running sequence is moved back to
waiting. - Stops when
max_num_batched_tokensormax_num_sequencesis reached. - Returns
(scheduled_sequences, False).
Return value
Sequences selected for this forward pass. Empty when both queues are empty.
True when the batch is a prefill pass; False for a decode pass.postprocess
ModelRunner.run() returns sampled token IDs. For each (seq, token_id) pair:
- Appends
token_idto the sequence’s token list. - Evaluates the three stopping conditions below.
- If any condition is met, sets
seq.status = FINISHED, deallocates its KV blocks, and removes it from therunningqueue.
Stopping conditions
| Condition | Triggered when |
|---|---|
| EOS token | token_id == eos and seq.ignore_eos is False |
| Max completion tokens | seq.num_completion_tokens >= seq.max_tokens |
| Max model length | seq.max_model_length is not None and seq.num_tokens >= seq.max_model_length |
The same list returned by the preceding
schedule() call.One sampled token ID per sequence, in the same order as
seqs.preempt
A sequence currently in the
running queue.is_finished
True when both the waiting and running queues are empty, indicating that all submitted sequences have completed generation.