Configuration
Vocabulary size — number of token embeddings. SmolLM2-135M uses
49152.Dimensionality of the transformer hidden state. SmolLM2-135M uses
576.Number of transformer decoder blocks. SmolLM2-135M uses
30.Number of query heads in grouped-query attention (GQA). SmolLM2-135M uses
9.Number of key/value heads (fewer than query heads for GQA). SmolLM2-135M uses
3.Inner dimension of the SwiGLU feed-forward network. SmolLM2-135M uses
1536.Epsilon for RMSNorm numerical stability. SmolLM2-135M uses
1e-5.Base frequency for Rotary Position Embeddings (RoPE). SmolLM2-135M uses
10000.0.Architecture
Each transformer block follows the standard LLaMA-style decoder design:- Pre-attention RMSNorm — normalizes the input before the attention block
- Grouped-query attention — projects Q/K/V, applies RoPE, then runs causal attention with fewer KV heads than Q heads
- Residual connection — adds attention output back to the input
- Post-attention RMSNorm — normalizes before the FFN
- SwiGLU FFN — gate and up projections followed by element-wise SwiGLU, then a down projection
- Residual connection — adds FFN output back
Building the inference graph
Thebuild_graph function constructs the full forward pass for a given sequence length:
"token_ids"— U32 tensor of shape[seq_len]
Loading HuggingFace weights
Weight names follow the HuggingFace safetensors convention. Linear layer weights are stored transposed in HuggingFace format ([out, in]) and must be transposed when loading:
lm_head.weight is often weight-tied to model.embed_tokens.weight in SmolLM2. If lm_head.weight is absent from the checkpoint, load model.embed_tokens.weight transposed and use it for both.Prefill and decode graphs
For autoregressive generation with KV cache, use the prefill and decode graph builders:"token_ids"— U32 tensor of shape[1](the current token)"kv_pos"— U32 tensor of shape[1](number of already-cached positions)