Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt

Use this file to discover all available pages before exploring further.

Prerequisites

Before you begin, confirm that you have:
  • Python 3.11 (exactly — >=3.11, <3.12 is required)
  • A CUDA-capable GPU (required for Triton kernels and torch CUDA ops)
  • Git to clone the repository

Installation

1

Install uv

miniVLLM uses uv for dependency management. Install it with:
curl -LsSf https://astral.sh/uv/install.sh | sh
2

Clone the repository

git clone https://github.com/Wenyueh/MinivLLM.git
cd MinivLLM
3

Sync dependencies

uv sync
uv sync reads pyproject.toml and installs all dependencies into an isolated virtual environment. No manual pip install is needed.

Run the inference demo

The main demo runs end-to-end inference through the custom engine:
uv run python main.py
This script:
  1. Loads the Qwen/Qwen3-0.6B tokenizer
  2. Initializes LLMEngine with a small Qwen3 model (random weights for speed)
  3. Creates chat prompts and tokenizes them using the model’s chat template
  4. Processes them through the engine using paged attention and KV cache management
  5. Generates up to 256 new tokens per prompt with temperature sampling
  6. Prints each prompt alongside its completion

Configuration

The demo is configured by the config dict at the top of main.py. The engine requires both scheduling/memory keys and model architecture keys:
config = {
    # Scheduling and memory
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "world_size": 1,
    "max_num_sequences": 16,
    "max_num_batched_tokens": 1024,
    "max_cached_blocks": 1024,
    "block_size": 256,
    "eos": 151645,
    "enforce_eager": True,
    "gpu_memory_utilization": 0.9,
    "max_num_batch_tokens": 4096,   # warmup batch size
    "max_model_length": 128,        # max total tokens (prompt + completion)
    # Qwen3-0.6B model architecture
    "vocab_size": 151936,
    "hidden_size": 1024,
    "num_heads": 16,
    "head_dim": 128,
    "num_kv_heads": 8,
    "intermediate_size": 3072,
    "num_layers": 28,
    "tie_word_embeddings": True,
    "base": 1000000,
    "rms_norm_epsilon": 1e-6,
    "qkv_bias": False,
    "scale": 1,
    "max_position": 32768,
    "ffn_bias": False,
}
KeyDescription
world_sizeNumber of GPUs. Set to n for tensor parallelism across n GPUs.
max_num_sequencesMaximum sequences scheduled in one iteration.
max_num_batched_tokensMaximum total tokens in a single forward pass (scheduler).
max_num_batch_tokensMaximum tokens in the warmup dry-run (model runner).
max_cached_blocksKV cache block pool hint; overridden by measured GPU memory at runtime.
block_sizeTokens per KV cache block.
eosEnd-of-sequence token ID. For Qwen3-0.6B: 151645.
enforce_eagerDisable CUDA graph capture when True (safer for debugging).
gpu_memory_utilizationFraction of free GPU memory reserved for the KV cache pool.
max_model_lengthMaximum total sequence length (prompt + completion tokens).

Multi-GPU setup

To run with multiple GPUs, set world_size in the config dict to the number of GPUs you want to use:
config = {
    ...
    "world_size": 2,  # use 2 GPUs
    ...
}
The engine uses tensor parallelism and spawns one worker process per additional GPU rank.

Run benchmarks

uv run python benchmark_prefilling.py
The prefill benchmark compares three attention implementations during the prompt-processing phase:
  1. PyTorch standard attention (O(N²) memory)
  2. Naive Triton kernel (O(N²) memory, limited to ≤128 tokens)
  3. Flash attention Triton kernel (O(N) memory)
The decode benchmark compares three paged attention implementations during token generation:
  1. Naive PyTorch loop over paged KV cache
  2. Optimized PyTorch with vectorized gathering and masking
  3. Custom Triton paged attention kernel

Use the API directly

You can use LLMEngine and SamplingParams directly in your own scripts:
import sys
from pathlib import Path

sys.path.insert(0, str(Path(".") / "src"))

from myvllm.engine.llm_engine import LLMEngine
from myvllm.sampling_parameters import SamplingParams
from transformers import AutoTokenizer

config = {
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "world_size": 1,
    "max_num_sequences": 16,
    "max_num_batched_tokens": 1024,
    "max_cached_blocks": 1024,
    "block_size": 256,
    "eos": 151645,
    "enforce_eager": True,
    "gpu_memory_utilization": 0.9,
    "max_num_batch_tokens": 4096,
    "max_model_length": 128,
    # Qwen3-0.6B architecture
    "vocab_size": 151936,
    "hidden_size": 1024,
    "num_heads": 16,
    "head_dim": 128,
    "num_kv_heads": 8,
    "intermediate_size": 3072,
    "num_layers": 28,
    "tie_word_embeddings": True,
    "base": 1000000,
    "rms_norm_epsilon": 1e-6,
    "qkv_bias": False,
    "scale": 1,
    "max_position": 32768,
    "ffn_bias": False,
}

tokenizer = AutoTokenizer.from_pretrained(config["model_name_or_path"])
llm = LLMEngine(config=config)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256, max_model_length=128)

prompts = ["introduce yourself", "list all prime numbers within 100"]
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": p}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for p in prompts
]

outputs = llm.generate(prompts, sampling_params)

for prompt, text in zip(prompts, outputs["text"]):
    print(f"Prompt: {prompt}")
    print(f"Completion: {text}\n")

SamplingParams fields

ParameterTypeDefaultDescription
temperaturefloat1.0Sampling temperature. Must be > 0 (greedy decoding is not supported).
max_tokensint64Maximum number of tokens to generate (completion tokens only).
max_model_lengthint | NoneNoneMaximum total sequence length (prompt + completion).
ignore_eosboolFalseContinue generating past the EOS token.

Understanding the output

During a generate() call, the engine prints throughput statistics for each scheduling step:
1024 number of processed tokens 18432.5 tokens/sec during prefilling
16 number of processed tokens 312.7 tokens/sec during decoding
  • Prefilling processes all input prompt tokens in parallel — throughput is high.
  • Decoding generates one token per active sequence per step — throughput reflects the cost of paged attention over the growing KV cache.
The final outputs dict contains:
  • outputs["text"] — list of decoded completion strings, one per prompt
  • outputs["token_ids"] — list of token ID lists for each completion

Build docs developers (and LLMs) love