TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
Scheduler in miniVLLM decides which sequences execute at every iteration. It manages two queues, enforces memory constraints via BlockManager, and ensures the system makes progress even when GPU memory is fully utilized.
The two queues
waiting— holds new sequences that have been submitted but have not yet had blocks allocated. Sequences at the front of the deque have the highest priority for the next prefill slot.running— holds sequences that have been prefilled at least once and are generating tokens one step at a time.
Scheduling priority: prefill before decode
At every call toschedule(), the scheduler tries to move sequences from waiting into running before advancing any already-running sequences. This prefill-first policy maximizes throughput by keeping the GPU busy with new work rather than idle cycles.
schedule() returns a boolean is_prefill flag. The model runner uses this to choose between flash_attention_prefill and paged_attention_decode.Prefill scheduling
Check block availability
block_manager.can_allocate(seq) returns True when len(free_block_ids) >= seq.num_blocks. This guarantees all of the sequence’s initial blocks can be reserved in one atomic step.Check token budget
The scheduler also tracks
current_scheduled_tokens. Adding a new sequence must not push the total above max_num_batched_tokens, which caps the size of the prefill forward pass and prevents OOM on activations.Allocate and transition
If both checks pass:
allocate(seq) is called (applying prefix caching), seq.status is set to RUNNING, and the sequence is appended to both running and the return list.Decode scheduling
If no sequences were admitted in the prefill phase, the scheduler advances currently running sequences:Preemption
When a running sequence needs a new block but no free blocks remain, the scheduler preempts the lowest-priority sequence (the tail ofrunning) to reclaim its blocks:
allocate will re-run, potentially recovering prefix-cached blocks for free.
miniVLLM uses swap-to-CPU preemption semantics — the sequence is returned to the waiting queue and must go through prefill again. Production vLLM also supports swapping KV blocks to CPU memory, but miniVLLM omits this for simplicity.
Postprocessing
After the model produces a new token for each running sequence,postprocess appends the token and checks stopping conditions:
EOS token
EOS token
The model emitted the end-of-sequence token. Stopped unless
sampling_params.ignore_eos=True.max_tokens
max_tokens
The number of generated (completion) tokens reached
sampling_params.max_tokens. The prompt tokens are not counted.max_model_length
max_model_length
The total token count (prompt + completion) reached
sampling_params.max_model_length. Useful for enforcing context-window limits.Interaction with BlockManager
BlockManager is a pure bookkeeping object — it maintains the free_block_ids deque and hash_to_block_id dict but does not touch GPU memory directly. The actual KV tensors are read and written by the Triton kernels in layers/attention.py.