Attention
Attention
File: Each Triton program handles one token. The
nanovllm/layers/attention.pyThe Attention module wraps FlashAttention and routes to the appropriate kernel depending on whether the current step is prefill or decode.KV cache writing — Triton kernelBefore attention is computed, the new key and value tensors are written to the paged KV cache using a custom Triton kernel:slot_mapping tensor maps token positions to flat slots in the paged KV cache. A slot value of -1 indicates a padding token and is skipped.forward()- Prefill uses
flash_attn_varlen_funcwhich handles variable-length sequences packed into a single batch tensor via cumulative sequence length arrays (cu_seqlens_q,cu_seqlens_k). - Decode uses
flash_attn_with_kvcachewhich reads from the paged KV cache directly, one token per sequence. - When prefix caching is active during prefill (
block_tables is not None),kandvare replaced with the full cache tensors so that cached key/value entries are attended to.
Linear Layers
Linear Layers
File: The RowParallelLinearUsed for output projections. Each rank holds a shard of the input dimension. Results are summed across ranks via
nanovllm/layers/linear.pyAll linear layers are tensor-parallel aware and implement a weight_loader hook used during model loading to shard weights correctly across ranks.QKVParallelLinearA column-parallel linear layer that fuses Q, K, and V projections into a single weight matrix. On each TP rank, only the shard corresponding to that rank’s heads is stored.weight_loader shards Q, K, and V independently by shard ID ("q", "k", "v").MergedColumnParallelLinearUsed for fused gate+up projections in the MLP. Accepts a list of output_sizes (one per merged sub-weight) and shards each sub-weight independently.dist.all_reduce.Bias is only added by rank 0 to avoid double-counting after the all-reduce.
RMSNorm
RMSNorm
File: When
nanovllm/layers/layernorm.pyRMSNorm implements Root Mean Square Layer Normalization. It provides two @torch.compile-decorated paths: a standard forward pass and a fused residual+norm path that avoids a separate addition.residual is passed, forward returns (normed_x, updated_residual). This pattern allows the residual stream to be carried separately and added just before each norm, reducing memory bandwidth.Rotary Embedding
Rotary Embedding
File: get_rope() factoryA module-level LRU-cached factory function ensures only one
nanovllm/layers/rotary_embedding.pyRotaryEmbedding precomputes a cos_sin_cache of shape (max_position_embeddings, 1, rotary_dim) at construction time and looks up the relevant entries by position index at runtime.RotaryEmbedding instance is created per unique set of parameters (used across all attention layers with the same config):rope_scaling is accepted as a parameter for API compatibility but is not implemented. Passing a non-None value raises an AssertionError.Activation — SiluAndMul
Activation — SiluAndMul
File: This is used after
nanovllm/layers/activation.pySiluAndMul implements the gated activation function used in the SwiGLU MLP variant. It splits the input in half along the last dimension and applies SiLU to the first half, then element-wise multiplies by the second half.MergedColumnParallelLinear projects the hidden state to 2 * intermediate_size, producing both the gate and the value in a single matmul.Sampler
Sampler
File: Dividing probabilities by independent Exponential(1) samples and taking the argmax is equivalent to sampling from the categorical distribution (Gumbel-max trick).
nanovllm/layers/sampler.pySampler converts the final logits tensor into sampled token IDs using temperature scaling and the Gumbel-max trick (equivalent to multinomial sampling).clamp_min_(1e-10) prevents division by zero.The sampler only runs on TP rank 0; other ranks return None.Embeddings and LM Head
Embeddings and LM Head
File: ParallelLMHeadSubclasses Logit shards are gathered to rank 0 via
nanovllm/layers/embed_head.pyVocabParallelEmbeddingShards the vocabulary embedding table across TP ranks. Each rank stores num_embeddings // tp_size rows. During the forward pass, tokens outside a rank’s shard are masked to zero, and the results are summed via all_reduce.VocabParallelEmbedding and uses the same weight shard for the output projection (weight tying). During prefill, it first selects only the last token of each sequence (the positions that need logits) before computing the full-vocabulary linear projection.dist.gather and concatenated to produce the full vocabulary logits. Ranks 1…N return None.