miniVLLM is a custom implementation of the vLLM LLM inference engine. It is built for educational clarity and functional correctness, replicating vLLM’s core mechanisms with self-contained Triton GPU kernels rather than depending on external attention libraries. The project is based on Nano-vLLM but extends it with a fully self-contained paged attention and flash attention implementation.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
Why miniVLLM exists
Large language model inference engines like vLLM are complex systems. miniVLLM exists to make these systems understandable by providing a clean, readable reference implementation that you can run, modify, and learn from. It is both:- Educational — each component maps directly to a concept in vLLM’s architecture, making it a practical study companion
- Functional — it runs real inference with paged attention, KV cache management, and continuous batching
How it relates to vLLM
miniVLLM replicates the core concepts that make vLLM efficient:| Concept | Description |
|---|---|
| PagedAttention | Non-contiguous KV cache blocks managed by a block manager, enabling high GPU memory utilization |
| Flash attention | Memory-efficient O(N) online softmax algorithm for the prefill phase, implemented as a custom Triton kernel |
| Continuous batching | Iteration-level scheduling that mixes prefill and decode sequences across steps |
| CUDA graphs | Optional graph capture for decode steps to reduce kernel launch overhead |
Key components
Thesrc/myvllm/ package is organized into the following layers:
LLMEngine— the top-level entry point. Accepts prompts and returns generated text.Scheduler— decides which sequences to prefill or decode on each iteration, and allocates KV cache blocks via the block manager.ModelRunner— runs the forward pass on GPU, handling both prefill and decode modes. Supports multi-GPU tensor parallelism.layers/— contains the custom Triton kernels for flash attention (prefill) and paged attention (decode).models/— wires the layers into complete model architectures (currently Qwen3).
Requirements
- Python
>=3.11, <3.12 - A CUDA-capable GPU
uvpackage manager- Core dependencies:
torch,transformers,xxhash,vllm>=0.15.0
Get started
Quick start
Run your first inference in a few commands.
Installation
Detailed setup guide with troubleshooting.