Quick Start
Run your first inference in minutes with a working code example.
Installation
Install Nano-vLLM and download model weights from Hugging Face.
Inference Guide
Learn how to batch requests, tune throughput, and interpret outputs.
API Reference
Explore the full public API —
LLM, SamplingParams, and Config.Why Nano-vLLM?
Nano-vLLM was built to answer a simple question: how much of vLLM’s performance can be achieved in a clean, readable codebase? The answer: essentially all of it.| Engine | Output Tokens | Time (s) | Throughput (tok/s) |
|---|---|---|---|
| vLLM | 133,966 | 98.37 | 1,361 |
| Nano-vLLM | 133,966 | 93.41 | 1,434 |
Benchmark run on RTX 4070 Laptop (8GB) with Qwen3-0.6B, 256 sequences, input/output lengths randomly sampled between 100–1024 tokens.
Key features
Prefix caching
Reuse KV cache blocks across requests that share a common prompt prefix.
Tensor parallelism
Scale across multiple GPUs using PyTorch NCCL for larger models.
CUDA graphs
Capture decode batches as CUDA graphs for lower kernel-launch overhead.
Architecture deep-dive
Understand every component — scheduler, KV cache, model runner.