Nano-vLLM

Nano-vLLM is a minimal implementation of a high-performance LLM inference engine. It delivers throughput comparable to vLLM while keeping the entire codebase readable and hackable — perfect for learning, research, and lightweight production deployments.

Quick Start

Run your first inference in minutes with a working code example.

Installation

Install Nano-vLLM and download model weights from Hugging Face.

Inference Guide

Learn how to batch requests, tune throughput, and interpret outputs.

API Reference

Explore the full public API — LLM, SamplingParams, and Config.

Why Nano-vLLM?

Nano-vLLM was built to answer a simple question: how much of vLLM’s performance can be achieved in a clean, readable codebase? The answer: essentially all of it.

Engine	Output Tokens	Time (s)	Throughput (tok/s)
vLLM	133,966	98.37	1,361
Nano-vLLM	133,966	93.41	1,434

Benchmark run on RTX 4070 Laptop (8GB) with Qwen3-0.6B, 256 sequences, input/output lengths randomly sampled between 100–1024 tokens.

Key features

Prefix caching

Reuse KV cache blocks across requests that share a common prompt prefix.

Tensor parallelism

Scale across multiple GPUs using PyTorch NCCL for larger models.

CUDA graphs

Capture decode batches as CUDA graphs for lower kernel-launch overhead.

Architecture deep-dive

Understand every component — scheduler, KV cache, model runner.

Get started

Install Nano-vLLM

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Download a model

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Run inference

from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
outputs = llm.generate(["Hello, Nano-vLLM."], sampling_params)
print(outputs[0]["text"])

Get Started

Guides

Architecture

Quick Start

Installation

Inference Guide

API Reference

Why Nano-vLLM?

Key features

Prefix caching

Tensor parallelism

CUDA graphs

Architecture deep-dive

Get started

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

Quick Start

Installation

Inference Guide

API Reference

​Why Nano-vLLM?

​Key features

Prefix caching

Tensor parallelism

CUDA graphs

Architecture deep-dive

​Get started

Build docs developers (and LLMs) love

Why Nano-vLLM?

Key features

Get started