Benchmarking

Running the benchmark

Nano-vLLM ships a self-contained benchmark script at bench.py. It generates a synthetic workload of random token sequences and measures sustained output throughput.

Download the model

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Install Nano-vLLM

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Run the benchmark

python bench.py

Expected output:

Total: 133966tok, Time: 93.41s, Throughput: 1434.13tok/s

Benchmark script

The full bench.py source:

import os
import time
from random import randint, seed
from nanovllm import LLM, SamplingParams
# from vllm import LLM, SamplingParams


def main():
    seed(0)
    num_seqs = 256
    max_input_len = 1024
    max_ouput_len = 1024

    path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
    llm = LLM(path, enforce_eager=False, max_model_len=4096)

    prompt_token_ids = [[randint(0, 10000) for _ in range(randint(100, max_input_len))] for _ in range(num_seqs)]
    sampling_params = [SamplingParams(temperature=0.6, ignore_eos=True, max_tokens=randint(100, max_ouput_len)) for _ in range(num_seqs)]
    # uncomment the following line for vllm
    # prompt_token_ids = [dict(prompt_token_ids=p) for p in prompt_token_ids]

    llm.generate(["Benchmark: "], SamplingParams())
    t = time.time()
    llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
    t = (time.time() - t)
    total_tokens = sum(sp.max_tokens for sp in sampling_params)
    throughput = total_tokens / t
    print(f"Total: {total_tokens}tok, Time: {t:.2f}s, Throughput: {throughput:.2f}tok/s")


if __name__ == "__main__":
    main()

Parameters explained

Parameter	Value	Description
`num_seqs`	256	Number of independent requests in the batch
`max_input_len`	1024	Upper bound for input length; actual lengths are uniformly sampled in `[100, 1024]`
`max_ouput_len`	1024	Upper bound for output length; actual lengths are uniformly sampled in `[100, 1024]`
`seed`	0	Fixed random seed for reproducibility
`ignore_eos`	`True`	Prevents early termination so sequences run to their sampled `max_tokens`

A single warm-up call (llm.generate(["Benchmark: "], SamplingParams())) is made before the timed run to ensure CUDA graphs and any JIT compilation are fully initialised. Throughput is measured as output tokens per second — only the generated tokens count, not the prompt tokens:

total_tokens = sum(sp.max_tokens for sp in sampling_params)
throughput = total_tokens / t

Results

Test configuration:

Hardware: RTX 4070 Laptop (8 GB VRAM)
Model: Qwen3-0.6B
Requests: 256 sequences
Input length: randomly sampled between 100–1024 tokens
Output length: randomly sampled between 100–1024 tokens

Inference Engine	Output Tokens	Time (s)	Throughput (tok/s)
vLLM	133,966	98.37	1,361.84
Nano-vLLM	133,966	93.41	1,434.13

Nano-vLLM achieves ~5% higher throughput than vLLM on this workload with significantly fewer lines of code.

Performance tips

Enable CUDA graphs

Keep enforce_eager=False (the default). Nano-vLLM captures CUDA graphs for decode batch sizes [1, 2, 4, 8, 16, 32, …, 512], which eliminates Python overhead and reduces kernel launch latency during decode.

llm = LLM(path, enforce_eager=False)

Set max_model_len appropriately

max_model_len is used during the warm-up pass to measure peak GPU memory usage and calculate how many KV cache blocks can be allocated. Setting it higher than your actual sequence lengths wastes VRAM on the warm-up forward pass.

llm = LLM(path, max_model_len=2048)  # if sequences are short

Maximise batch size

The scheduler fills decode batches up to max_num_seqs (default 512). Sending more concurrent requests keeps the GPU fully occupied. In the benchmark, all 256 requests are submitted before generate() starts so the scheduler can always find work.

Use token ID input for benchmarks

Passing list[list[int]] instead of list[str] skips the tokenizer and removes it from the critical path. The benchmark does this explicitly:

prompt_token_ids = [
    [randint(0, 10000) for _ in range(randint(100, 1024))]
    for _ in range(num_seqs)
]
llm.generate(prompt_token_ids, sampling_params)

Get Started

Guides

Architecture

Running the benchmark

Benchmark script

Parameters explained

Results

Performance tips

Enable CUDA graphs

Set max_model_len appropriately

Maximise batch size

Use token ID input for benchmarks

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​Running the benchmark

​Benchmark script

​Parameters explained

​Results

​Performance tips

Enable CUDA graphs

Set max_model_len appropriately

Maximise batch size

Use token ID input for benchmarks

Build docs developers (and LLMs) love

Running the benchmark

Benchmark script

Parameters explained

Results

Performance tips