Running the benchmark
Nano-vLLM ships a self-contained benchmark script atbench.py. It generates a synthetic workload of random token sequences and measures sustained output throughput.
Benchmark script
The fullbench.py source:
Parameters explained
| Parameter | Value | Description |
|---|---|---|
num_seqs | 256 | Number of independent requests in the batch |
max_input_len | 1024 | Upper bound for input length; actual lengths are uniformly sampled in [100, 1024] |
max_ouput_len | 1024 | Upper bound for output length; actual lengths are uniformly sampled in [100, 1024] |
seed | 0 | Fixed random seed for reproducibility |
ignore_eos | True | Prevents early termination so sequences run to their sampled max_tokens |
llm.generate(["Benchmark: "], SamplingParams())) is made before the timed run to ensure CUDA graphs and any JIT compilation are fully initialised.
Throughput is measured as output tokens per second — only the generated tokens count, not the prompt tokens:
Results
Test configuration:- Hardware: RTX 4070 Laptop (8 GB VRAM)
- Model: Qwen3-0.6B
- Requests: 256 sequences
- Input length: randomly sampled between 100–1024 tokens
- Output length: randomly sampled between 100–1024 tokens
| Inference Engine | Output Tokens | Time (s) | Throughput (tok/s) |
|---|---|---|---|
| vLLM | 133,966 | 98.37 | 1,361.84 |
| Nano-vLLM | 133,966 | 93.41 | 1,434.13 |
Performance tips
Enable CUDA graphs
Keep
enforce_eager=False (the default). Nano-vLLM captures CUDA graphs for
decode batch sizes [1, 2, 4, 8, 16, 32, …, 512], which eliminates Python
overhead and reduces kernel launch latency during decode.Set max_model_len appropriately
max_model_len is used during the warm-up pass to measure peak GPU memory
usage and calculate how many KV cache blocks can be allocated. Setting it
higher than your actual sequence lengths wastes VRAM on the warm-up forward
pass.Maximise batch size
The scheduler fills decode batches up to
max_num_seqs (default 512).
Sending more concurrent requests keeps the GPU fully occupied. In the
benchmark, all 256 requests are submitted before generate() starts so the
scheduler can always find work.Use token ID input for benchmarks
Passing
list[list[int]] instead of list[str] skips the tokenizer and
removes it from the critical path. The benchmark does this explicitly: