Models

miniVLLM provides two built-in causal LM implementations that follow a common interface:

forward(input_ids) returns the final hidden states.
compute_logits(hidden_states) projects hidden states to vocabulary logits via the LM head.
A class-level packed_module_mapping dict maps model parameter names to checkpoint keys, enabling correct loading of fused/sharded weights.

Qwen3ForCausalLM

Qwen3 architecture with Q/K norms and GQA

LlamaForCausalLM

Llama 3 architecture with NTK-scaled RoPE

Qwen3ForCausalLM

myvllm.models.qwen3.Qwen3ForCausalLM Full Qwen3 causal language model. Stacks num_layers of Qwen3DecoderLayer inside a Qwen3Model backbone, then attaches a ParallelLMHead for next-token prediction. Supports optional weight tying between the token embedding and the LM head. Key architectural differences from Llama 3:

Per-head Q and K RMSNorm layers inside each attention block (applied when qkv_bias=False).
Default RoPE base of 10000 (versus 500000 for Llama 3).
Default max position of 16384.

Constructor

vocab_size

int

required

Vocabulary size. Controls the size of the token embedding table and the LM head.

hidden_size

int

required

Model hidden dimension (embedding size and residual stream width).

num_heads

int

required

Total number of query attention heads across all tensor-parallel ranks.

head_dim

int

Dimension of each attention head. Defaults to hidden_size // num_heads.

scale

float

default:"1.0"

Attention scale multiplier applied alongside 1 / sqrt(head_dim).

num_kv_heads

int

Total number of key/value heads. Set to a value smaller than num_heads for grouped-query attention (GQA). Defaults to num_heads.

rms_norm_epsilon

float

default:"1e-5"

Epsilon used in all RMSNorm layers.

qkv_bias

bool

default:"False"

Whether to include bias in the QKV projection. When False, per-head Q and K norms are applied before attention.

base

int

default:"10000"

RoPE base frequency.

max_position

int

default:"16384"

Maximum sequence length the positional embedding cache is pre-computed for.

intermediate_size

int

default:"4096"

Hidden dimension of the MLP feed-forward layers.

ffn_bias

bool

default:"True"

Whether to include bias in the MLP projections.

num_layers

int

default:"12"

Number of transformer decoder layers.

tie_word_embeddings

bool

default:"False"

If True, the LM head shares the same weight tensor as the token embedding.

block_size

int

default:"256"

Paged KV cache block size, passed through to each Attention module.

forward

forward(input_ids: torch.Tensor) -> torch.Tensor

Runs the full Qwen3 backbone (embedding → decoder layers → final RMSNorm) and returns the hidden states. Does not project to logits; call compute_logits separately.

Argument	Shape	Description
`input_ids`	`(total_tokens,)` or `(batch_size, seq_len)`	Token IDs
returns	same leading dims, `hidden_size` last	Final hidden states

compute_logits

compute_logits(hidden_states: torch.Tensor) -> torch.Tensor

Projects hidden states to vocabulary logits using ParallelLMHead. In a tensor-parallel setup, rank 0 gathers logits from all ranks and returns the full (batch_or_tokens, vocab_size) tensor; other ranks return a partial shard.

Sub-components

Class	Role
`Qwen3Model`	Embedding + decoder layer stack + final norm
`Qwen3DecoderLayer`	Single transformer block (norm → attn → norm → MLP) with fused residual
`Qwen3Attention`	QKV projection, optional Q/K norms, RoPE, paged attention, output projection
`Qwen3MLP`	Gate+up projection (merged), SiluAndMul activation, down projection

packed_module_mapping

Qwen3ForCausalLM defines a class attribute that maps internal parameter names to their corresponding keys in a HuggingFace-style checkpoint and the sub-index within a fused weight:

packed_module_mapping = {
    "q_proj":   ('q_proj',        'q'),
    "k_proj":   ('k_proj',        'k'),
    "v_proj":   ('v_proj',        'v'),
    "gate_up":  ('gate_up_proj',  '0'),
    "gate_down":("gate_down_proj",'1'),
}

The model runner uses this mapping to call the correct weight_loader overload when loading pre-trained checkpoints. Quick start

import torch
from myvllm.models.qwen3 import Qwen3ForCausalLM

model = Qwen3ForCausalLM(
    vocab_size=151936,
    hidden_size=2048,
    num_heads=16,
    num_kv_heads=8,
    intermediate_size=6144,
    num_layers=28,
    max_position=32768,
).cuda()

input_ids = torch.randint(0, 151936, (1, 64)).cuda()
hidden    = model(input_ids)         # (1, 64, 2048)
logits    = model.compute_logits(hidden)  # (1, 64, 151936) on rank 0

LlamaForCausalLM

myvllm.models.llama.LlamaForCausalLM Llama 3 causal language model. Structurally identical to Qwen3ForCausalLM but with the following differences:

No Q/K norms — LlamaAttn does not apply per-head RMSNorm to queries and keys.
NTK-scaled RoPE — RotaryEmbedding is initialized with is_llama3=True, enabling the NTK-by-parts long-context frequency scaling.
Higher RoPE base — defaults to 500000 instead of 10000.
Larger context window — defaults to 131072 instead of 16384.
tie_word_embeddings=True by default.

Constructor

vocab_size

int

default:"128256"

Vocabulary size.

hidden_size

int

default:"2048"

Model hidden dimension.

head_dim

int

default:"64"

Dimension of each attention head.

num_qo_heads

int

default:"32"

Total number of query/output heads across all tensor-parallel ranks.

num_kv_heads

int

default:"8"

Total number of key/value heads.

has_attn_bias

bool

default:"False"

Whether to add bias in the QKV projection.

rms_norm_epsilon

float

default:"1e-5"

Epsilon for all RMSNorm layers.

rope_base

int

default:"500000"

RoPE base frequency. The higher value extends the effective context length.

max_position_embeddings

int

default:"131072"

Maximum sequence length the positional cache covers.

intermediate_size

int

default:"8192"

MLP feed-forward hidden dimension.

ffn_bias

bool

default:"False"

Whether to include bias in the MLP projections.

num_layers

int

default:"16"

Number of transformer decoder layers.

block_size

int

default:"256"

Paged KV cache block size.

tie_word_embeddings

bool

default:"True"

If True, the LM head shares the embedding weight.

forward

forward(input_ids: torch.Tensor) -> torch.Tensor

Same interface as Qwen3ForCausalLM.forward. Returns final hidden states.

compute_logits

compute_logits(hidden_states: torch.Tensor) -> torch.Tensor

Same interface as Qwen3ForCausalLM.compute_logits. Projects to vocabulary logits.

Sub-components

Class	Role
`LlamaModel`	Embedding + decoder layer stack + final norm
`LlamaDecoderLayer`	Single transformer block with fused residual
`LlamaAttn`	QKV projection, NTK-scaled RoPE, paged attention, output projection
`LlamaMLP`	Gate+up projection (merged), SiluAndMul activation, down projection

Quick start

import torch
from myvllm.models.llama import LlamaForCausalLM

model = LlamaForCausalLM(
    vocab_size=128256,
    hidden_size=2048,
    head_dim=64,
    num_qo_heads=32,
    num_kv_heads=8,
    num_layers=16,
).cuda()

input_ids = torch.randint(0, 128256, (1, 32)).cuda()
hidden    = model(input_ids)
logits    = model.compute_logits(hidden)

Adding a new model

Follow these steps to add a new architecture to miniVLLM.

Implement the ForCausalLM class

Create src/myvllm/models/mymodel.py. The class must expose:

class MyModelForCausalLM(nn.Module):
    # Maps internal param names to (checkpoint_key, sub_index)
    packed_module_mapping: dict[str, tuple[str, str]] = { ... }

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        """Return final hidden states."""
        ...

    def compute_logits(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """Project hidden states to logits."""
        ...

Use the existing layer primitives from myvllm.layers (ColumnParallelLinear, RowParallelLinear, LayerNorm, Attention, etc.) to ensure correct tensor-parallel behavior.

Implement weight_loader for custom sharding

If your model has parameters that do not map 1-to-1 to checkpoint keys — e.g., fused QKV or merged gate+up projections — override weight_loader on the relevant nn.Parameter:

def custom_weight_loader(
    param: nn.Parameter,
    loaded_weights: torch.Tensor,
    shard_id: str,          # 'q' / 'k' / 'v' or integer index
):
    # 1. Compute the offset and shard size for this rank
    # 2. Slice the correct rows/columns from loaded_weights
    # 3. Copy into param.data
    ...

layer.weight.weight_loader = custom_weight_loader

Refer to QKVColumnParallelLinear.weight_loader and MergedColumnParallelLinear.weight_loader in myvllm/layers/linear.py for reference implementations.

Open the model runner’s model registry and add an entry for your new class:

MODEL_REGISTRY = {
    "qwen3":    Qwen3ForCausalLM,
    "llama":    LlamaForCausalLM,
    "mymodel":  MyModelForCausalLM,   # add this line
}

The runner uses this registry to instantiate the correct class from the model config.

Export from the models package

Add the import to src/myvllm/models/__init__.py:

from .mymodel import MyModelForCausalLM

Keep each decoder layer’s forward signature consistent with Qwen3DecoderLayer — accepting (x, residual) and returning (x, residual) — so the backbone loop works without modification.

Engine

Layers & Models

Qwen3ForCausalLM

LlamaForCausalLM

Qwen3ForCausalLM

Constructor

forward

compute_logits

Sub-components

packed_module_mapping

LlamaForCausalLM

Constructor

forward

compute_logits

Sub-components

Adding a new model

Build docs developers (and LLMs) love

Engine

Layers & Models

Documentation Index

Qwen3ForCausalLM

LlamaForCausalLM

​Qwen3ForCausalLM

​Constructor

​forward

​compute_logits

​Sub-components

​packed_module_mapping

​LlamaForCausalLM

​Constructor

​forward

​compute_logits

​Sub-components

​Adding a new model

Build docs developers (and LLMs) love

Qwen3ForCausalLM

Constructor

forward

compute_logits

Sub-components

packed_module_mapping

LlamaForCausalLM

Constructor

forward

compute_logits

Sub-components

Adding a new model