Introduction

Nano-vLLM is a clean, readable reimplementation of vLLM built from scratch. It delivers comparable — and in some configurations faster — throughput than vLLM, while keeping the entire codebase under 1,200 lines of Python. The API mirrors vLLM’s interface, so if you’re already familiar with vLLM you can start using Nano-vLLM immediately.

Performance

Benchmark run on an RTX 4070 Laptop (8 GB) with the Qwen3-0.6B model, 256 sequences, input lengths randomly sampled between 100–1024 tokens, and output lengths randomly sampled between 100–1024 tokens.

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.37	1,361.84
Nano-vLLM	133,966	93.41	1,434.13

Benchmark results are hardware- and model-dependent. Run bench.py from the repository to reproduce or measure performance on your own hardware.

Key Features

Fast Offline Inference

Throughput comparable to — and sometimes exceeding — vLLM on the same hardware.

Readable Codebase

The entire engine is implemented in ~1,200 lines of Python, making it easy to read, understand, and modify.

Prefix Caching

Reuse KV-cache blocks for shared prompt prefixes, reducing redundant computation across requests.

Tensor Parallelism

Distribute model weights across multiple GPUs with configurable tensor_parallel_size (1–8 GPUs).

CUDA Graphs

Capture and replay decode steps as CUDA graphs to reduce kernel launch overhead and improve throughput.

Flash Attention

Uses flash-attn for memory-efficient attention computation with reduced VRAM usage.

Next Steps

Installation

Install Nano-vLLM and its dependencies.

Quickstart

Run your first inference in minutes.

Installation

Auto-generate your docs

Performance
Key Features
Next Steps

Build docs developers (and LLMs) love

Get started for free Talk to us

Get Started

Guides

Architecture

Performance

Key Features

Fast Offline Inference

Readable Codebase

Prefix Caching

Tensor Parallelism

CUDA Graphs

Flash Attention

Next Steps

Installation

Quickstart

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​Performance

​Key Features

Fast Offline Inference

Readable Codebase

Prefix Caching

Tensor Parallelism

CUDA Graphs

Flash Attention

​Next Steps

Installation

Quickstart

Build docs developers (and LLMs) love

Performance

Key Features

Next Steps