Skip to main content

Requirements

  • Python >=3.10, <3.13
  • CUDA-capable GPU (NVIDIA)
  • The following Python packages (installed automatically):
PackageMinimum Version
torch>=2.4.0
triton>=3.0.0
transformers>=4.51.0
flash-attnlatest
xxhashlatest
flash-attn is compiled against your local CUDA toolkit during installation and can take several minutes to build. Ensure your CUDA version is compatible with your PyTorch installation before proceeding.

Install Nano-vLLM

1

Install from GitHub

Install the latest version of Nano-vLLM directly from the source repository:
pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
This will also install all required dependencies listed above.
2

Download a model

Nano-vLLM loads models from a local directory. Use huggingface-cli to download model weights:
huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False
Replace Qwen/Qwen3-0.6B and the --local-dir path with the model and location of your choice. Any model supported by transformers and flash-attn should work.
If huggingface-cli is not installed, run pip install huggingface_hub first.
3

Verify the installation

Confirm that Nano-vLLM imported successfully:
from nanovllm import LLM, SamplingParams
print("Nano-vLLM installed successfully")

Next Steps

Once installation is complete, follow the Quickstart guide to run your first inference.

Build docs developers (and LLMs) love