Every model in miniVLLM is assembled from a small set of reusable layers defined inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
myvllm/layers/. Each layer owns its weight-loading logic so the same code works on one GPU or many.
Layer overview
SiluAndMul
Gated SiLU activation used in every MLP block.
LayerNorm
RMSNorm with optional fused residual addition.
Linear layers
Four parallel variants that shard weights across GPUs.
VocabParallelEmbedding / ParallelLMHead
Vocabulary sharded across GPUs with tied-weight support.
Attention
Routes to flash attention (prefill) or paged attention (decode).
RotaryEmbedding
RoPE with Llama 3 NTK/YARN long-context scaling.
SamplerLayer
Temperature-scaled multinomial sampling via Gumbel trick.
SiluAndMul
File:myvllm/layers/activation.py
Used in every MLP block as the gated activation function. The input tensor is split in half along the last dimension: the first half is passed through SiLU, and the result is multiplied element-wise by the second half.
@torch.compile fuses the two operations into a single CUDA kernel, which improves throughput for large tensors.
LayerNorm (RMSNorm)
File:myvllm/layers/layernorm.py
MiniVLLM uses Root Mean Square normalization — the mean-centering step of standard LayerNorm is skipped, reducing compute by ~30% with no measurable quality loss for large models.
residual is provided, the layer adds the residual before normalizing and returns the pre-norm sum as the new residual. This fuses the residual connection into the norm op, saving a separate addition kernel at each decoder layer.
Linear layers
File:myvllm/layers/linear.py
All linear layers inherit from LinearBase, which attaches a weight_loader callable to every nn.Parameter. When loading a checkpoint, the engine calls this loader instead of copying the full weight, letting each GPU extract only its shard.
ColumnParallelLinear — split output features
ColumnParallelLinear — split output features
Splits the output dimension (Used for Q, K, V projections and MLP gate/up projections.
dim=0 of the weight matrix) evenly across tp_size GPUs. Each GPU computes a partial output independently — no communication is needed during the forward pass.RowParallelLinear — split input features
RowParallelLinear — split input features
Splits the input dimension (Always paired with a preceding
dim=1 of the weight matrix) across GPUs. Each GPU holds a column slice of the weight. Because each GPU only computes a partial dot product, a dist.all_reduce is required to sum the partial results after the matrix multiply.ColumnParallelLinear: the column-parallel layer shards the output, which becomes the sharded input consumed by the row-parallel layer.MergedColumnParallelLinear — gate + up in one tensor
MergedColumnParallelLinear — gate + up in one tensor
Extends
ColumnParallelLinear to hold two or more sub-matrices stacked along dim=0. This matches the checkpoint layout where gate_proj.weight and up_proj.weight are stored separately but the model stores them as a single merged tensor.loaded_weight_id=0 loads the gate projection, loaded_weight_id=1 loads the up projection.QKVColumnParallelLinear — complete attention heads per GPU
QKVColumnParallelLinear — complete attention heads per GPU
A specialized column-parallel layer for Q, K, V projections in attention. Unlike a generic column split, this class ensures that each GPU owns complete attention heads (not fractional ones), which is required for grouped-query attention (GQA).The
weight_loader accepts load_weight_id as 'q', 'k', or 'v' and computes the correct offset within the merged QKV parameter.VocabParallelEmbedding and ParallelLMHead
File:myvllm/layers/embedding_head.py
The vocabulary is partitioned across GPUs along dim=0 (the token dimension, not the embedding dimension). Each GPU stores vocab_size // tp_size embedding rows.
ParallelLMHead extends VocabParallelEmbedding and is used as the output projection. During prefill it selects only the last token of each sequence before computing logits, then gathers partial logits from all GPUs to rank 0.
Weight tying (
lm_head.weight = embed_tokens.weight) is supported and halves the memory consumed by vocabulary parameters.Attention
File:myvllm/layers/attention.py
The Attention module dispatches to one of two Triton kernels depending on the inference phase:
- Prefill — flash_attention_prefill
- Decode — paged_attention_decode
For the prefill phase, all input tokens from all sequences are concatenated into a single flat tensor. The Triton kernel handles variable-length sequences via The kernel implements online softmax in blocks (Flash Attention style) with a causal mask applied within each sequence boundary.
cu_seqlens (cumulative sequence lengths).Attention.forward method stores newly computed K and V into the cache before choosing which kernel to call:
RotaryEmbedding
File:myvllm/layers/rotary_embedding.py
Rotary Position Embedding (RoPE) encodes token positions by rotating query and key vectors in the complex plane. The frequency spectrum spans from high frequency (captures local relationships) to low frequency (captures long-range relationships).
max_position and stored as a buffer (cos_sin_cache). At inference time only a lookup is needed.
Llama 3 NTK scaling. When is_llama3=True, inverse frequencies are adjusted before building the cache:
- High-frequency dimensions (short wavelength) are left unchanged — the model has seen many full cycles during training and can extrapolate.
- Low-frequency dimensions (long wavelength) are divided by
llama3_rope_factor— the model has never seen a full cycle for these, so the position is compressed back into the training distribution. - A smooth interpolation (
smoothfactor clamped to[0, 1]) is applied between the two regimes.
SamplerLayer
File:myvllm/layers/sampler.py
Sampling converts raw logits into discrete token IDs. MiniVLLM uses the Gumbel-max trick, which is mathematically equivalent to sampling from the softmax distribution but avoids the need for an explicit torch.multinomial call.
SamplerLayer is called only on rank 0 (the scheduler rank). Worker GPUs compute the model forward pass but do not sample.The weight_loader pattern
Everynn.Parameter created by the parallel layer classes has a weight_loader attribute attached at construction time. The checkpoint loader in myvllm/utils/loader.py checks for this attribute before copying weights:
weight_loader — the loading loop is unchanged.