The TPS benchmark measures end-to-end inference throughput: how many output tokens the engine produces per second across all sequences in a batch. Run it with:Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
What TPS measures
TPS (tokens per second) is calculated as:generate() call to the last cuda_sync(). This is end-to-end throughput — it includes both the prefill cost for each prompt and the decode cost for every generated token.
TPS is a throughput metric, not a latency metric. A high TPS means the engine is keeping the GPU busy across the full batch. A low per-sequence latency does not necessarily imply high TPS if the batch is small.
Engines compared
miniVLLM
The custom engine under development. Uses paged attention, a block-based KV cache, and an iteration-level scheduler. Configured via a Python
config dict.vLLM
The reference production engine. Serves as the performance target. Loaded with
gpu_memory_utilization=0.75 and max_model_len=256.Transformers
The Hugging Face baseline. No paging, no batched scheduling — uses
model.generate() directly with padded inputs.Benchmark setup
The benchmark targetsQwen/Qwen3-0.6B with three prompts that vary in expected output length:
miniVLLM engine configuration
The config dict used in the benchmark:How TPS is reported
Each engine function returns a dict with three fields:Prefill TPS vs. decode TPS
The benchmark reports a single blended TPS number that covers both phases:- Prefill TPS
- Decode TPS
Prefill processes all input prompt tokens in parallel. It is compute-bound and typically fast per token. For short prompts with long outputs, prefill is a small fraction of total time.Prefill cost is roughly proportional to
sum(prompt_lengths) and grows with sequence length due to the O(N²) attention computation.Factors that affect TPS
Batch size (max_num_sequences)
Batch size (max_num_sequences)
Larger batches amortize the fixed per-step overhead across more sequences and keep the GPU’s streaming multiprocessors busy. The benchmark uses
max_num_sequences=16. Increasing this value (if GPU memory allows) is the single most effective way to raise throughput.max_num_batched_tokens
max_num_batched_tokens
Controls how many tokens the scheduler is allowed to prefill in a single iteration. A higher limit lets the engine process more prompt tokens per step during prefill, reducing the number of prefill iterations. The benchmark uses
max_num_batched_tokens=1024.block_size
block_size
The number of token positions stored per physical KV cache block. A larger
block_size reduces block table overhead but increases internal fragmentation (unused slots at the end of each sequence’s last block). The benchmark uses block_size=256.GPU memory (gpu_memory_utilization)
GPU memory (gpu_memory_utilization)
Set to
0.9 for miniVLLM — 90% of available GPU memory is reserved for the KV cache pool (max_cached_blocks=1024). More memory means more sequences can be cached simultaneously before eviction is needed.Sequence length
Sequence length
Longer sequences mean more tokens to attend over at each decode step, which increases per-step latency and reduces decode TPS. The benchmark caps generation at
OUTPUT_TOKENS=256 and the model’s max_model_length=128.enforce_eager
enforce_eager
Set to
True in the benchmark, which disables CUDA graph capture. Enabling CUDA graphs (enforce_eager=False) can significantly improve decode TPS for fixed batch sizes by reducing kernel launch overhead.