Skip to main content
Nano-vLLM is a clean, readable reimplementation of vLLM built from scratch. It delivers comparable — and in some configurations faster — throughput than vLLM, while keeping the entire codebase under 1,200 lines of Python. The API mirrors vLLM’s interface, so if you’re already familiar with vLLM you can start using Nano-vLLM immediately.

Performance

Benchmark run on an RTX 4070 Laptop (8 GB) with the Qwen3-0.6B model, 256 sequences, input lengths randomly sampled between 100–1024 tokens, and output lengths randomly sampled between 100–1024 tokens.
Inference EngineOutput TokensTime (s)Throughput (tokens/s)
vLLM133,96698.371,361.84
Nano-vLLM133,96693.411,434.13
Benchmark results are hardware- and model-dependent. Run bench.py from the repository to reproduce or measure performance on your own hardware.

Key Features

Fast Offline Inference

Throughput comparable to — and sometimes exceeding — vLLM on the same hardware.

Readable Codebase

The entire engine is implemented in ~1,200 lines of Python, making it easy to read, understand, and modify.

Prefix Caching

Reuse KV-cache blocks for shared prompt prefixes, reducing redundant computation across requests.

Tensor Parallelism

Distribute model weights across multiple GPUs with configurable tensor_parallel_size (1–8 GPUs).

CUDA Graphs

Capture and replay decode steps as CUDA graphs to reduce kernel launch overhead and improve throughput.

Flash Attention

Uses flash-attn for memory-efficient attention computation with reduced VRAM usage.

Next Steps

Installation

Install Nano-vLLM and its dependencies.

Quickstart

Run your first inference in minutes.

Build docs developers (and LLMs) love