Quick start

Prerequisites

Before you begin, confirm that you have:

Python 3.11 (exactly — >=3.11, <3.12 is required)
A CUDA-capable GPU (required for Triton kernels and torch CUDA ops)
Git to clone the repository

Installation

Install uv

miniVLLM uses uv for dependency management. Install it with:

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository

git clone https://github.com/Wenyueh/MinivLLM.git
cd MinivLLM

Sync dependencies

uv sync

uv sync reads pyproject.toml and installs all dependencies into an isolated virtual environment. No manual pip install is needed.

Run the inference demo

The main demo runs end-to-end inference through the custom engine:

uv run python main.py

This script:

Loads the Qwen/Qwen3-0.6B tokenizer
Initializes LLMEngine with a small Qwen3 model (random weights for speed)
Creates chat prompts and tokenizes them using the model’s chat template
Processes them through the engine using paged attention and KV cache management
Generates up to 256 new tokens per prompt with temperature sampling
Prints each prompt alongside its completion

Configuration

The demo is configured by the config dict at the top of main.py. The engine requires both scheduling/memory keys and model architecture keys:

config = {
    # Scheduling and memory
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "world_size": 1,
    "max_num_sequences": 16,
    "max_num_batched_tokens": 1024,
    "max_cached_blocks": 1024,
    "block_size": 256,
    "eos": 151645,
    "enforce_eager": True,
    "gpu_memory_utilization": 0.9,
    "max_num_batch_tokens": 4096,   # warmup batch size
    "max_model_length": 128,        # max total tokens (prompt + completion)
    # Qwen3-0.6B model architecture
    "vocab_size": 151936,
    "hidden_size": 1024,
    "num_heads": 16,
    "head_dim": 128,
    "num_kv_heads": 8,
    "intermediate_size": 3072,
    "num_layers": 28,
    "tie_word_embeddings": True,
    "base": 1000000,
    "rms_norm_epsilon": 1e-6,
    "qkv_bias": False,
    "scale": 1,
    "max_position": 32768,
    "ffn_bias": False,
}

Key	Description
`world_size`	Number of GPUs. Set to `n` for tensor parallelism across `n` GPUs.
`max_num_sequences`	Maximum sequences scheduled in one iteration.
`max_num_batched_tokens`	Maximum total tokens in a single forward pass (scheduler).
`max_num_batch_tokens`	Maximum tokens in the warmup dry-run (model runner).
`max_cached_blocks`	KV cache block pool hint; overridden by measured GPU memory at runtime.
`block_size`	Tokens per KV cache block.
`eos`	End-of-sequence token ID. For Qwen3-0.6B: `151645`.
`enforce_eager`	Disable CUDA graph capture when `True` (safer for debugging).
`gpu_memory_utilization`	Fraction of free GPU memory reserved for the KV cache pool.
`max_model_length`	Maximum total sequence length (prompt + completion tokens).

Multi-GPU setup

To run with multiple GPUs, set world_size in the config dict to the number of GPUs you want to use:

config = {
    ...
    "world_size": 2,  # use 2 GPUs
    ...
}

The engine uses tensor parallelism and spawns one worker process per additional GPU rank.

Run benchmarks

uv run python benchmark_prefilling.py

The prefill benchmark compares three attention implementations during the prompt-processing phase:

PyTorch standard attention (O(N²) memory)
Naive Triton kernel (O(N²) memory, limited to ≤128 tokens)
Flash attention Triton kernel (O(N) memory)

The decode benchmark compares three paged attention implementations during token generation:

Naive PyTorch loop over paged KV cache
Optimized PyTorch with vectorized gathering and masking
Custom Triton paged attention kernel

Use the API directly

You can use LLMEngine and SamplingParams directly in your own scripts:

import sys
from pathlib import Path

sys.path.insert(0, str(Path(".") / "src"))

from myvllm.engine.llm_engine import LLMEngine
from myvllm.sampling_parameters import SamplingParams
from transformers import AutoTokenizer

config = {
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "world_size": 1,
    "max_num_sequences": 16,
    "max_num_batched_tokens": 1024,
    "max_cached_blocks": 1024,
    "block_size": 256,
    "eos": 151645,
    "enforce_eager": True,
    "gpu_memory_utilization": 0.9,
    "max_num_batch_tokens": 4096,
    "max_model_length": 128,
    # Qwen3-0.6B architecture
    "vocab_size": 151936,
    "hidden_size": 1024,
    "num_heads": 16,
    "head_dim": 128,
    "num_kv_heads": 8,
    "intermediate_size": 3072,
    "num_layers": 28,
    "tie_word_embeddings": True,
    "base": 1000000,
    "rms_norm_epsilon": 1e-6,
    "qkv_bias": False,
    "scale": 1,
    "max_position": 32768,
    "ffn_bias": False,
}

tokenizer = AutoTokenizer.from_pretrained(config["model_name_or_path"])
llm = LLMEngine(config=config)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256, max_model_length=128)

prompts = ["introduce yourself", "list all prime numbers within 100"]
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": p}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for p in prompts
]

outputs = llm.generate(prompts, sampling_params)

for prompt, text in zip(prompts, outputs["text"]):
    print(f"Prompt: {prompt}")
    print(f"Completion: {text}\n")

SamplingParams fields

Parameter	Type	Default	Description
`temperature`	`float`	`1.0`	Sampling temperature. Must be > 0 (greedy decoding is not supported).
`max_tokens`	`int`	`64`	Maximum number of tokens to generate (completion tokens only).
`max_model_length`	`int \| None`	`None`	Maximum total sequence length (prompt + completion).
`ignore_eos`	`bool`	`False`	Continue generating past the EOS token.

Understanding the output

During a generate() call, the engine prints throughput statistics for each scheduling step:

1024 number of processed tokens 18432.5 tokens/sec during prefilling
16 number of processed tokens 312.7 tokens/sec during decoding

Prefilling processes all input prompt tokens in parallel — throughput is high.
Decoding generates one token per active sequence per step — throughput reflects the cost of paged attention over the growing KV cache.

The final outputs dict contains:

outputs["text"] — list of decoded completion strings, one per prompt
outputs["token_ids"] — list of token ID lists for each completion

Get Started

Core Concepts

Architecture Guide

Benchmarks

Prerequisites

Installation

Run the inference demo

Configuration

Multi-GPU setup

Run benchmarks

Use the API directly

SamplingParams fields

Understanding the output

Build docs developers (and LLMs) love

Get Started

Core Concepts

Architecture Guide

Benchmarks

Documentation Index

​Prerequisites

​Installation

​Run the inference demo

​Configuration

​Multi-GPU setup

​Run benchmarks

​Use the API directly

​SamplingParams fields

​Understanding the output

Build docs developers (and LLMs) love

Prerequisites

Installation

Run the inference demo

Configuration

Multi-GPU setup

Run benchmarks

Use the API directly

SamplingParams fields

Understanding the output