Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt

Use this file to discover all available pages before exploring further.

miniVLLM is a custom implementation of the vLLM LLM inference engine. It is built for educational clarity and functional correctness, replicating vLLM’s core mechanisms with self-contained Triton GPU kernels rather than depending on external attention libraries. The project is based on Nano-vLLM but extends it with a fully self-contained paged attention and flash attention implementation.

Why miniVLLM exists

Large language model inference engines like vLLM are complex systems. miniVLLM exists to make these systems understandable by providing a clean, readable reference implementation that you can run, modify, and learn from. It is both:
  • Educational — each component maps directly to a concept in vLLM’s architecture, making it a practical study companion
  • Functional — it runs real inference with paged attention, KV cache management, and continuous batching

How it relates to vLLM

miniVLLM replicates the core concepts that make vLLM efficient:
ConceptDescription
PagedAttentionNon-contiguous KV cache blocks managed by a block manager, enabling high GPU memory utilization
Flash attentionMemory-efficient O(N) online softmax algorithm for the prefill phase, implemented as a custom Triton kernel
Continuous batchingIteration-level scheduling that mixes prefill and decode sequences across steps
CUDA graphsOptional graph capture for decode steps to reduce kernel launch overhead

Key components

The src/myvllm/ package is organized into the following layers:
src/myvllm/
├── engine/
│   ├── llm_engine.py      # Public generation API (LLMEngine)
│   ├── scheduler.py       # Iteration-level sequence scheduling
│   ├── model_runner.py    # Prefill and decode execution
│   └── sequence.py        # Sequence and block definitions
├── models/                # Model implementations (e.g. Qwen3)
├── layers/                # Attention, MLP, normalization layers
├── utils/                 # Shared utilities
└── sampling_parameters.py # SamplingParams dataclass
  • LLMEngine — the top-level entry point. Accepts prompts and returns generated text.
  • Scheduler — decides which sequences to prefill or decode on each iteration, and allocates KV cache blocks via the block manager.
  • ModelRunner — runs the forward pass on GPU, handling both prefill and decode modes. Supports multi-GPU tensor parallelism.
  • layers/ — contains the custom Triton kernels for flash attention (prefill) and paged attention (decode).
  • models/ — wires the layers into complete model architectures (currently Qwen3).

Requirements

  • Python >=3.11, <3.12
  • A CUDA-capable GPU
  • uv package manager
  • Core dependencies: torch, transformers, xxhash, vllm>=0.15.0

Get started

Quick start

Run your first inference in a few commands.

Installation

Detailed setup guide with troubleshooting.

Build docs developers (and LLMs) love