miniVLLM is a minimal, readable implementation of the vLLM LLM inference engine. Built on top of Nano-vLLM, it features fully self-contained custom Triton kernels for both paged attention (decode) and flash attention (prefill), making it an ideal resource for learning how production LLM serving systems work — and for running them.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
Quick Start
Run your first inference in under 5 minutes with a working code example.
Installation
Install miniVLLM and its dependencies with uv.
Core Concepts
Understand paged attention, flash attention, KV caching, and scheduling.
API Reference
Full reference for LLMEngine, SamplingParams, and all public APIs.
What is miniVLLM?
miniVLLM implements the full LLM inference pipeline from scratch, including:- Custom Triton kernels — paged attention for decode, flash attention (O(N) memory) for prefill
- Paged KV cache — memory-efficient KV cache management with prefix caching
- Iteration-level scheduler — prefill-first scheduling with preemption support
- Multi-GPU tensor parallelism — distributed inference via NCCL
- CUDA graph optimization — low-latency decode via captured replay graphs
Getting started
Install dependencies
Install uv and sync the project:
Explore the architecture
Read the Architecture Guide to understand how each component fits together, or follow the step-by-step implementation guide.
Explore by topic
Paged Attention
How KV cache is managed in fixed-size pages to eliminate fragmentation.
Flash Attention
O(N) memory attention via online softmax, implemented in Triton.
Scheduling
Iteration-level prefill/decode scheduling with preemption.
Multi-GPU
Tensor parallelism across GPUs using NCCL all-reduce.
Benchmarks
Comparative benchmarks of PyTorch, Triton, and Flash Attention.
Models
Qwen3 and Llama 3.2 implementations built on parallel layers.