Introduction

miniVLLM is a custom implementation of the vLLM LLM inference engine. It is built for educational clarity and functional correctness, replicating vLLM’s core mechanisms with self-contained Triton GPU kernels rather than depending on external attention libraries. The project is based on Nano-vLLM but extends it with a fully self-contained paged attention and flash attention implementation.

Why miniVLLM exists

Large language model inference engines like vLLM are complex systems. miniVLLM exists to make these systems understandable by providing a clean, readable reference implementation that you can run, modify, and learn from. It is both:

Educational — each component maps directly to a concept in vLLM’s architecture, making it a practical study companion
Functional — it runs real inference with paged attention, KV cache management, and continuous batching

How it relates to vLLM

miniVLLM replicates the core concepts that make vLLM efficient:

Concept	Description
PagedAttention	Non-contiguous KV cache blocks managed by a block manager, enabling high GPU memory utilization
Flash attention	Memory-efficient O(N) online softmax algorithm for the prefill phase, implemented as a custom Triton kernel
Continuous batching	Iteration-level scheduling that mixes prefill and decode sequences across steps
CUDA graphs	Optional graph capture for decode steps to reduce kernel launch overhead

Key components

The src/myvllm/ package is organized into the following layers:

src/myvllm/
├── engine/
│   ├── llm_engine.py      # Public generation API (LLMEngine)
│   ├── scheduler.py       # Iteration-level sequence scheduling
│   ├── model_runner.py    # Prefill and decode execution
│   └── sequence.py        # Sequence and block definitions
├── models/                # Model implementations (e.g. Qwen3)
├── layers/                # Attention, MLP, normalization layers
├── utils/                 # Shared utilities
└── sampling_parameters.py # SamplingParams dataclass

LLMEngine — the top-level entry point. Accepts prompts and returns generated text.
Scheduler — decides which sequences to prefill or decode on each iteration, and allocates KV cache blocks via the block manager.
ModelRunner — runs the forward pass on GPU, handling both prefill and decode modes. Supports multi-GPU tensor parallelism.
layers/ — contains the custom Triton kernels for flash attention (prefill) and paged attention (decode).
models/ — wires the layers into complete model architectures (currently Qwen3).

Requirements

Python >=3.11, <3.12
A CUDA-capable GPU
uv package manager
Core dependencies: torch, transformers, xxhash, vllm>=0.15.0

Get started

Quick start

Run your first inference in a few commands.

Installation

Detailed setup guide with troubleshooting.

Get Started

Core Concepts

Architecture Guide

Benchmarks

Why miniVLLM exists

How it relates to vLLM

Key components

Requirements

Get started

Quick start

Installation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Architecture Guide

Benchmarks

Documentation Index

​Why miniVLLM exists

​How it relates to vLLM

​Key components

​Requirements

​Get started

Quick start

Installation

Build docs developers (and LLMs) love

Why miniVLLM exists

How it relates to vLLM

Key components

Requirements

Get started