Skip to main content

Running the benchmark

Nano-vLLM ships a self-contained benchmark script at bench.py. It generates a synthetic workload of random token sequences and measures sustained output throughput.
1

Download the model

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False
2

Install Nano-vLLM

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
3

Run the benchmark

python bench.py
Expected output:
Total: 133966tok, Time: 93.41s, Throughput: 1434.13tok/s

Benchmark script

The full bench.py source:
import os
import time
from random import randint, seed
from nanovllm import LLM, SamplingParams
# from vllm import LLM, SamplingParams


def main():
    seed(0)
    num_seqs = 256
    max_input_len = 1024
    max_ouput_len = 1024

    path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
    llm = LLM(path, enforce_eager=False, max_model_len=4096)

    prompt_token_ids = [[randint(0, 10000) for _ in range(randint(100, max_input_len))] for _ in range(num_seqs)]
    sampling_params = [SamplingParams(temperature=0.6, ignore_eos=True, max_tokens=randint(100, max_ouput_len)) for _ in range(num_seqs)]
    # uncomment the following line for vllm
    # prompt_token_ids = [dict(prompt_token_ids=p) for p in prompt_token_ids]

    llm.generate(["Benchmark: "], SamplingParams())
    t = time.time()
    llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
    t = (time.time() - t)
    total_tokens = sum(sp.max_tokens for sp in sampling_params)
    throughput = total_tokens / t
    print(f"Total: {total_tokens}tok, Time: {t:.2f}s, Throughput: {throughput:.2f}tok/s")


if __name__ == "__main__":
    main()

Parameters explained

ParameterValueDescription
num_seqs256Number of independent requests in the batch
max_input_len1024Upper bound for input length; actual lengths are uniformly sampled in [100, 1024]
max_ouput_len1024Upper bound for output length; actual lengths are uniformly sampled in [100, 1024]
seed0Fixed random seed for reproducibility
ignore_eosTruePrevents early termination so sequences run to their sampled max_tokens
A single warm-up call (llm.generate(["Benchmark: "], SamplingParams())) is made before the timed run to ensure CUDA graphs and any JIT compilation are fully initialised. Throughput is measured as output tokens per second — only the generated tokens count, not the prompt tokens:
total_tokens = sum(sp.max_tokens for sp in sampling_params)
throughput = total_tokens / t

Results

Test configuration:
  • Hardware: RTX 4070 Laptop (8 GB VRAM)
  • Model: Qwen3-0.6B
  • Requests: 256 sequences
  • Input length: randomly sampled between 100–1024 tokens
  • Output length: randomly sampled between 100–1024 tokens
Inference EngineOutput TokensTime (s)Throughput (tok/s)
vLLM133,96698.371,361.84
Nano-vLLM133,96693.411,434.13
Nano-vLLM achieves ~5% higher throughput than vLLM on this workload with significantly fewer lines of code.

Performance tips

Enable CUDA graphs

Keep enforce_eager=False (the default). Nano-vLLM captures CUDA graphs for decode batch sizes [1, 2, 4, 8, 16, 32, …, 512], which eliminates Python overhead and reduces kernel launch latency during decode.
llm = LLM(path, enforce_eager=False)

Set max_model_len appropriately

max_model_len is used during the warm-up pass to measure peak GPU memory usage and calculate how many KV cache blocks can be allocated. Setting it higher than your actual sequence lengths wastes VRAM on the warm-up forward pass.
llm = LLM(path, max_model_len=2048)  # if sequences are short

Maximise batch size

The scheduler fills decode batches up to max_num_seqs (default 512). Sending more concurrent requests keeps the GPU fully occupied. In the benchmark, all 256 requests are submitted before generate() starts so the scheduler can always find work.

Use token ID input for benchmarks

Passing list[list[int]] instead of list[str] skips the tokenizer and removes it from the critical path. The benchmark does this explicitly:
prompt_token_ids = [
    [randint(0, 10000) for _ in range(randint(100, 1024))]
    for _ in range(num_seqs)
]
llm.generate(prompt_token_ids, sampling_params)

Build docs developers (and LLMs) love