All layers are exported fromDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Wenyueh/MinivLLM/llms.txt
Use this file to discover all available pages before exploring further.
myvllm.layers and are designed for tensor-parallel inference with torch.distributed. Each layer that performs tensor parallelism reads the process group rank and world size from torch.distributed at construction time.
Activations
SiLU-gated MLP activations
Normalization
Fused RMSNorm with residual add
Linear
Tensor-parallel column and row splits
Embeddings
Vocab-parallel lookup and LM head
RoPE
Rotary positional embeddings
Sampler
Temperature sampling
SiluAndMul
myvllm.layers.activation.SiluAndMul
A fused activation layer used in MLP blocks. Expects the gate and up projections to have been concatenated along the last dimension (as produced by MergedColumnParallelLinear). Internally it splits the input in half and applies SiLU to the first chunk, then multiplies element-wise by the second chunk.
forward method is compiled with torch.compile.
forward
| Argument | Shape | Description |
|---|---|---|
x | (..., 2 * d) | Concatenated gate + up projection output |
| returns | (..., d) | SiLU-gated activations |
LayerNorm
myvllm.layers.layernorm.LayerNorm
Root Mean Square Layer Normalization (RMSNorm). Supports an optional fused residual add that adds the residual to x before normalizing and returns both the normalized output and the updated residual, eliminating a separate addition operation in the decoder layer.
rms_forward method is compiled with torch.compile.
Constructor
Initial scale (γ) parameter of shape
(hidden_size,). Copied and wrapped in nn.Parameter so that it participates in gradient computation and checkpoint loading.Small constant added inside the square root for numerical stability.
forward
residual is None, returns the normalized tensor directly.
When residual is provided, fuses the residual add:
- Computes
x_new = x + residual. - Normalizes
x_newwith RMSNorm. - Returns
(normalized, x_new)— the second element is the updated residual for the next sub-layer.
Normalized tensor, same shape as
x.Updated residual
x + residual_in. Only returned when residual is passed as input.Linear layers
All linear layers inherit fromLinearBase and implement a weight_loader method that knows how to extract the correct shard from a full pre-trained checkpoint tensor and copy it into the local (already-sharded) parameter buffer.
weight_loader pattern
When loading a checkpoint, iterate over the model’s named parameters and call the custom loader if present:MergedColumnParallelLinear, QKVColumnParallelLinear), the loader accepts an extra argument that identifies which sub-matrix the checkpoint tensor belongs to. See packed_module_mapping in the model classes for how names are resolved.
ColumnParallelLinear
myvllm.layers.linear.ColumnParallelLinear
Splits the output dimension across tensor-parallel ranks. Each GPU holds output_size / tp_size output rows. No collective communication is needed in the forward pass — outputs are naturally sharded.
Full input feature dimension (replicated on all GPUs).
Full output feature dimension before sharding. Must be divisible by
tp_size.Whether to add a bias term.
RowParallelLinear
myvllm.layers.linear.RowParallelLinear
Splits the input dimension across tensor-parallel ranks. Each GPU holds input_size / tp_size input columns. An all_reduce is performed in the forward pass to sum the partial results and produce a replicated output.
Full input feature dimension before sharding. Must be divisible by
tp_size.Output feature dimension (replicated on all GPUs after
all_reduce).Whether to add a bias term.
MergedColumnParallelLinear
myvllm.layers.linear.MergedColumnParallelLinear
An extension of ColumnParallelLinear that stores multiple matrices (e.g., the gate and up projections of an MLP) as a single fused weight matrix. This lets both projections be computed in one F.linear call.
The weight_loader accepts a loaded_weight_id: int argument that specifies which sub-matrix (by index into output_sizes) the incoming checkpoint tensor corresponds to.
Input feature dimension.
List of output sizes for each merged sub-matrix. For a standard gate+up MLP:
[intermediate_size, intermediate_size].Whether to add a bias term.
QKVColumnParallelLinear
myvllm.layers.linear.QKVColumnParallelLinear
A specialized column-parallel linear that packs the Q, K, and V projections into a single weight matrix. Accommodates grouped-query attention by allowing num_kv_heads < num_heads, so K and V occupy fewer output rows than Q.
Per-GPU output size: head_size * (num_heads/tp_size + 2 * num_kv_heads/tp_size).
Hidden size of the model.
Dimension of each attention head.
Total number of query heads across all tensor-parallel ranks.
Total number of key/value heads. Defaults to
num_heads.Whether to add a bias term.
weight_loader accepts a load_weight_id: str argument — one of 'q', 'k', or 'v' — to route each checkpoint tensor to the correct offset within the fused weight.
VocabParallelEmbedding
myvllm.layers.embedding_head.VocabParallelEmbedding
An embedding table that partitions the vocabulary across tensor-parallel ranks. Each GPU owns ceil(num_embeddings / tp_size) token embeddings. Tokens outside a rank’s range contribute a zero vector; an all_reduce sums contributions so that every rank receives the correct full embedding.
True vocabulary size before padding.
Embedding dimension.
The vocabulary is padded to the nearest multiple of
tp_size for even sharding. The padding rows are zeroed and do not affect output correctness.ParallelLMHead
myvllm.layers.embedding_head.ParallelLMHead
Inherits from VocabParallelEmbedding and reuses the same weight matrix for the final logit projection (weight tying). In prefill mode it automatically selects only the last token of each sequence before computing logits, reducing unnecessary computation.
In a tensor-parallel setup each rank computes logits for its vocabulary shard. Rank 0 then gathers all shards via dist.gather and concatenates them before trimming to the true vocab size.
Vocabulary size.
Model hidden size.
RotaryEmbedding
myvllm.layers.rotary_embedding.RotaryEmbedding
Computes Rotary Position Embeddings (RoPE). Precomputes a (max_position, rotary_embedding) cache of interleaved [cos, sin] values at construction time and looks up the relevant rows at inference time via the position indices.
Supports the Llama 3 long-context frequency scaling strategy (NTK-by-parts), enabled by setting is_llama3=True.
Constructor
RoPE base frequency (e.g.,
10000 for Qwen3, 500000 for Llama 3).Number of head dimensions to apply rotary embedding to. Typically equal to
head_dim.Maximum sequence length the cache is pre-computed for.
Enable Llama 3 NTK-by-parts frequency scaling for long-context support.
forward
cos/sin values for the given positions and applies apply_rotary_pos_emb to both query and key.
Supports both varlen (total_tokens, num_heads, head_dim) and batched (B, seq_len, num_heads, head_dim) input shapes.
| Argument | Type | Description |
|---|---|---|
positions | torch.Tensor | 1-D integer tensor of token positions |
query | torch.Tensor | Query tensor |
key | torch.Tensor | Key tensor |
| returns | (q_rotated, k_rotated) | Rotated query and key tensors, same shapes as inputs |
SamplerLayer
myvllm.layers.sampler.SamplerLayer
Applies temperature scaling to logits and samples the next token using Gumbel-max sampling (equivalent to categorical sampling but numerically efficient).
The forward method is compiled with torch.compile.
forward
| Argument | Shape | Description |
|---|---|---|
logits | (batch_size, vocab_size) | Raw logits from the LM head |
temperature | (batch_size,) | Per-sequence temperature values. Use 1.0 for no scaling |
| returns | (batch_size,) | Sampled token IDs |
- Divide logits by temperature:
logits / temperature. - Compute softmax probabilities.
- Sample via Gumbel-max:
argmax(probs / Exponential(1))— equivalent to multinomial sampling.