miniVLLM provides two built-in causal LM implementations that follow a common interface:Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
forward(input_ids)returns the final hidden states.compute_logits(hidden_states)projects hidden states to vocabulary logits via the LM head.- A class-level
packed_module_mappingdict maps model parameter names to checkpoint keys, enabling correct loading of fused/sharded weights.
Qwen3ForCausalLM
Qwen3 architecture with Q/K norms and GQA
LlamaForCausalLM
Llama 3 architecture with NTK-scaled RoPE
Qwen3ForCausalLM
myvllm.models.qwen3.Qwen3ForCausalLM
Full Qwen3 causal language model. Stacks num_layers of Qwen3DecoderLayer inside a Qwen3Model backbone, then attaches a ParallelLMHead for next-token prediction. Supports optional weight tying between the token embedding and the LM head.
Key architectural differences from Llama 3:
- Per-head Q and K RMSNorm layers inside each attention block (applied when
qkv_bias=False). - Default RoPE base of
10000(versus500000for Llama 3). - Default max position of
16384.
Constructor
Vocabulary size. Controls the size of the token embedding table and the LM head.
Model hidden dimension (embedding size and residual stream width).
Total number of query attention heads across all tensor-parallel ranks.
Dimension of each attention head. Defaults to
hidden_size // num_heads.Attention scale multiplier applied alongside
1 / sqrt(head_dim).Total number of key/value heads. Set to a value smaller than
num_heads for grouped-query attention (GQA). Defaults to num_heads.Epsilon used in all RMSNorm layers.
Whether to include bias in the QKV projection. When
False, per-head Q and K norms are applied before attention.RoPE base frequency.
Maximum sequence length the positional embedding cache is pre-computed for.
Hidden dimension of the MLP feed-forward layers.
Whether to include bias in the MLP projections.
Number of transformer decoder layers.
If
True, the LM head shares the same weight tensor as the token embedding.Paged KV cache block size, passed through to each
Attention module.forward
compute_logits separately.
| Argument | Shape | Description |
|---|---|---|
input_ids | (total_tokens,) or (batch_size, seq_len) | Token IDs |
| returns | same leading dims, hidden_size last | Final hidden states |
compute_logits
ParallelLMHead. In a tensor-parallel setup, rank 0 gathers logits from all ranks and returns the full (batch_or_tokens, vocab_size) tensor; other ranks return a partial shard.
Sub-components
| Class | Role |
|---|---|
Qwen3Model | Embedding + decoder layer stack + final norm |
Qwen3DecoderLayer | Single transformer block (norm → attn → norm → MLP) with fused residual |
Qwen3Attention | QKV projection, optional Q/K norms, RoPE, paged attention, output projection |
Qwen3MLP | Gate+up projection (merged), SiluAndMul activation, down projection |
packed_module_mapping
Qwen3ForCausalLM defines a class attribute that maps internal parameter names to their corresponding keys in a HuggingFace-style checkpoint and the sub-index within a fused weight:
weight_loader overload when loading pre-trained checkpoints.
Quick start
LlamaForCausalLM
myvllm.models.llama.LlamaForCausalLM
Llama 3 causal language model. Structurally identical to Qwen3ForCausalLM but with the following differences:
- No Q/K norms —
LlamaAttndoes not apply per-head RMSNorm to queries and keys. - NTK-scaled RoPE —
RotaryEmbeddingis initialized withis_llama3=True, enabling the NTK-by-parts long-context frequency scaling. - Higher RoPE base — defaults to
500000instead of10000. - Larger context window — defaults to
131072instead of16384. tie_word_embeddings=Trueby default.
Constructor
Vocabulary size.
Model hidden dimension.
Dimension of each attention head.
Total number of query/output heads across all tensor-parallel ranks.
Total number of key/value heads.
Whether to add bias in the QKV projection.
Epsilon for all RMSNorm layers.
RoPE base frequency. The higher value extends the effective context length.
Maximum sequence length the positional cache covers.
MLP feed-forward hidden dimension.
Whether to include bias in the MLP projections.
Number of transformer decoder layers.
Paged KV cache block size.
If
True, the LM head shares the embedding weight.forward
Qwen3ForCausalLM.forward. Returns final hidden states.
compute_logits
Qwen3ForCausalLM.compute_logits. Projects to vocabulary logits.
Sub-components
| Class | Role |
|---|---|
LlamaModel | Embedding + decoder layer stack + final norm |
LlamaDecoderLayer | Single transformer block with fused residual |
LlamaAttn | QKV projection, NTK-scaled RoPE, paged attention, output projection |
LlamaMLP | Gate+up projection (merged), SiluAndMul activation, down projection |
Adding a new model
Follow these steps to add a new architecture to miniVLLM.Implement the ForCausalLM class
Create Use the existing layer primitives from
src/myvllm/models/mymodel.py. The class must expose:myvllm.layers (ColumnParallelLinear, RowParallelLinear, LayerNorm, Attention, etc.) to ensure correct tensor-parallel behavior.Implement weight_loader for custom sharding
If your model has parameters that do not map 1-to-1 to checkpoint keys — e.g., fused QKV or merged gate+up projections — override Refer to
weight_loader on the relevant nn.Parameter:QKVColumnParallelLinear.weight_loader and MergedColumnParallelLinear.weight_loader in myvllm/layers/linear.py for reference implementations.Register in the model runner
Open the model runner’s model registry and add an entry for your new class:The runner uses this registry to instantiate the correct class from the model config.