Performance
Benchmark run on an RTX 4070 Laptop (8 GB) with the Qwen3-0.6B model, 256 sequences, input lengths randomly sampled between 100–1024 tokens, and output lengths randomly sampled between 100–1024 tokens.| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| vLLM | 133,966 | 98.37 | 1,361.84 |
| Nano-vLLM | 133,966 | 93.41 | 1,434.13 |
Benchmark results are hardware- and model-dependent. Run
bench.py from the repository to reproduce or measure performance on your own hardware.Key Features
Fast Offline Inference
Throughput comparable to — and sometimes exceeding — vLLM on the same hardware.
Readable Codebase
The entire engine is implemented in ~1,200 lines of Python, making it easy to read, understand, and modify.
Prefix Caching
Reuse KV-cache blocks for shared prompt prefixes, reducing redundant computation across requests.
Tensor Parallelism
Distribute model weights across multiple GPUs with configurable
tensor_parallel_size (1–8 GPUs).CUDA Graphs
Capture and replay decode steps as CUDA graphs to reduce kernel launch overhead and improve throughput.
Flash Attention
Uses
flash-attn for memory-efficient attention computation with reduced VRAM usage.Next Steps
Installation
Install Nano-vLLM and its dependencies.
Quickstart
Run your first inference in minutes.