Skip to main content
Nano-vLLM is a minimal implementation of a high-performance LLM inference engine. It delivers throughput comparable to vLLM while keeping the entire codebase readable and hackable — perfect for learning, research, and lightweight production deployments.

Quick Start

Run your first inference in minutes with a working code example.

Installation

Install Nano-vLLM and download model weights from Hugging Face.

Inference Guide

Learn how to batch requests, tune throughput, and interpret outputs.

API Reference

Explore the full public API — LLM, SamplingParams, and Config.

Why Nano-vLLM?

Nano-vLLM was built to answer a simple question: how much of vLLM’s performance can be achieved in a clean, readable codebase? The answer: essentially all of it.
EngineOutput TokensTime (s)Throughput (tok/s)
vLLM133,96698.371,361
Nano-vLLM133,96693.411,434
Benchmark run on RTX 4070 Laptop (8GB) with Qwen3-0.6B, 256 sequences, input/output lengths randomly sampled between 100–1024 tokens.

Key features

Prefix caching

Reuse KV cache blocks across requests that share a common prompt prefix.

Tensor parallelism

Scale across multiple GPUs using PyTorch NCCL for larger models.

CUDA graphs

Capture decode batches as CUDA graphs for lower kernel-launch overhead.

Architecture deep-dive

Understand every component — scheduler, KV cache, model runner.

Get started

1

Install Nano-vLLM

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
2

Download a model

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False
3

Run inference

from nanovllm import LLM, SamplingParams

llm = LLM("/path/to/Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
outputs = llm.generate(["Hello, Nano-vLLM."], sampling_params)
print(outputs[0]["text"])

Build docs developers (and LLMs) love