Configuration
SmolVLM2 uses two separate configs — one for the vision encoder and one for the text decoder — wrapped inSmolVLM2Config.
VisionConfig
Input image size in pixels (square). SmolVLM2-500M uses
512.Patch size in pixels (square). SmolVLM2-500M uses
16, giving 32×32 = 1024 patches.Hidden dimensionality of the vision transformer. SmolVLM2-500M uses
768.Number of attention heads in the vision transformer. SmolVLM2-500M uses
12.Number of vision transformer layers. SmolVLM2-500M uses
12.Vision MLP intermediate size. SmolVLM2-500M uses
3072 (4× hidden).Layer norm epsilon. SmolVLM2-500M uses
1e-6.TextConfig
Vocabulary size. SmolVLM2-500M uses
49280.Text decoder hidden dimensionality. SmolVLM2-500M uses
960.Number of text decoder layers. SmolVLM2-500M uses
32.Number of query heads in GQA. SmolVLM2-500M uses
15.Number of KV heads. SmolVLM2-500M uses
5.SwiGLU FFN intermediate size. SmolVLM2-500M uses
2560.RMSNorm epsilon. SmolVLM2-500M uses
1e-5.RoPE base frequency. SmolVLM2-500M uses
100000.0.SmolVLM2Config
Vision encoder configuration.
Text decoder configuration.
Pixel shuffle scale factor for spatial downsampling before the connector projection. SmolVLM2-500M uses
2.Architecture
Vision encoder (SigLIP ViT)
The image is divided into
patch_size × patch_size patches. Each patch is linearly projected into the vision hidden space. A standard transformer encoder (LayerNorm + full attention + LayerNorm + MLP) processes all patches in parallel.Pixel shuffle connector
Vision features are spatially downsampled by
scale_factor via pixel shuffle, then projected into the text decoder’s embedding dimension with a linear layer.Building the graph
"image_patches"— F32 tensor of shape[num_patches, patch_dim]wherenum_patches = (image_size/patch_size)²andpatch_dim = 3 * patch_size²"vision_features_shuffled"— F32 tensor of shape[num_vision_tokens, connector_input_dim]— pixel-shuffled vision features after preprocessing"combined_embeds"— F32 tensor of shape[num_vision_tokens + text_seq_len, hidden_size]— concatenated vision and text embeddings"token_ids"— U32 tensor of shape[text_seq_len]
build_vision_encoder(&mut g, &config.vision, num_patches) separately to build just the vision encoder subgraph.
Key parameter names
Vision encoder parameters follow the pattern:model.text_model.layers.{i}.