Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, confirm that you have:- Python 3.11 (exactly —
>=3.11, <3.12is required) - A CUDA-capable GPU (required for Triton kernels and torch CUDA ops)
- Git to clone the repository
Installation
Install uv
miniVLLM uses uv for dependency management. Install it with:
Run the inference demo
The main demo runs end-to-end inference through the custom engine:- Loads the
Qwen/Qwen3-0.6Btokenizer - Initializes
LLMEnginewith a small Qwen3 model (random weights for speed) - Creates chat prompts and tokenizes them using the model’s chat template
- Processes them through the engine using paged attention and KV cache management
- Generates up to 256 new tokens per prompt with temperature sampling
- Prints each prompt alongside its completion
Configuration
The demo is configured by theconfig dict at the top of main.py. The engine requires both scheduling/memory keys and model architecture keys:
| Key | Description |
|---|---|
world_size | Number of GPUs. Set to n for tensor parallelism across n GPUs. |
max_num_sequences | Maximum sequences scheduled in one iteration. |
max_num_batched_tokens | Maximum total tokens in a single forward pass (scheduler). |
max_num_batch_tokens | Maximum tokens in the warmup dry-run (model runner). |
max_cached_blocks | KV cache block pool hint; overridden by measured GPU memory at runtime. |
block_size | Tokens per KV cache block. |
eos | End-of-sequence token ID. For Qwen3-0.6B: 151645. |
enforce_eager | Disable CUDA graph capture when True (safer for debugging). |
gpu_memory_utilization | Fraction of free GPU memory reserved for the KV cache pool. |
max_model_length | Maximum total sequence length (prompt + completion tokens). |
Multi-GPU setup
To run with multiple GPUs, setworld_size in the config dict to the number of GPUs you want to use:
Run benchmarks
- PyTorch standard attention (O(N²) memory)
- Naive Triton kernel (O(N²) memory, limited to ≤128 tokens)
- Flash attention Triton kernel (O(N) memory)
- Naive PyTorch loop over paged KV cache
- Optimized PyTorch with vectorized gathering and masking
- Custom Triton paged attention kernel
Use the API directly
You can useLLMEngine and SamplingParams directly in your own scripts:
SamplingParams fields
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature | float | 1.0 | Sampling temperature. Must be > 0 (greedy decoding is not supported). |
max_tokens | int | 64 | Maximum number of tokens to generate (completion tokens only). |
max_model_length | int | None | None | Maximum total sequence length (prompt + completion). |
ignore_eos | bool | False | Continue generating past the EOS token. |
Understanding the output
During agenerate() call, the engine prints throughput statistics for each scheduling step:
- Prefilling processes all input prompt tokens in parallel — throughput is high.
- Decoding generates one token per active sequence per step — throughput reflects the cost of paged attention over the growing KV cache.
outputs dict contains:
outputs["text"]— list of decoded completion strings, one per promptoutputs["token_ids"]— list of token ID lists for each completion