Skip to main content
SmolVLM2 is a compact vision-language model. It processes image patches through a SigLIP-style vision transformer, then projects the resulting features into the text model’s embedding space via a pixel shuffle connector and cross-attention layers in the text decoder.

Configuration

SmolVLM2 uses two separate configs — one for the vision encoder and one for the text decoder — wrapped in SmolVLM2Config.

VisionConfig

image_size
usize
Input image size in pixels (square). SmolVLM2-500M uses 512.
patch_size
usize
Patch size in pixels (square). SmolVLM2-500M uses 16, giving 32×32 = 1024 patches.
hidden_size
usize
Hidden dimensionality of the vision transformer. SmolVLM2-500M uses 768.
num_attention_heads
u32
Number of attention heads in the vision transformer. SmolVLM2-500M uses 12.
num_hidden_layers
usize
Number of vision transformer layers. SmolVLM2-500M uses 12.
intermediate_size
usize
Vision MLP intermediate size. SmolVLM2-500M uses 3072 (4× hidden).
layer_norm_eps
f32
Layer norm epsilon. SmolVLM2-500M uses 1e-6.

TextConfig

vocab_size
usize
Vocabulary size. SmolVLM2-500M uses 49280.
hidden_size
usize
Text decoder hidden dimensionality. SmolVLM2-500M uses 960.
num_hidden_layers
usize
Number of text decoder layers. SmolVLM2-500M uses 32.
num_attention_heads
u32
Number of query heads in GQA. SmolVLM2-500M uses 15.
num_key_value_heads
u32
Number of KV heads. SmolVLM2-500M uses 5.
intermediate_size
usize
SwiGLU FFN intermediate size. SmolVLM2-500M uses 2560.
rms_norm_eps
f32
RMSNorm epsilon. SmolVLM2-500M uses 1e-5.
rope_theta
f32
RoPE base frequency. SmolVLM2-500M uses 100000.0.

SmolVLM2Config

vision
VisionConfig
Vision encoder configuration.
text
TextConfig
Text decoder configuration.
scale_factor
usize
Pixel shuffle scale factor for spatial downsampling before the connector projection. SmolVLM2-500M uses 2.
Use the built-in preset:
use meganeura::models::smolvlm2::{SmolVLM2Config, build_graph};

let config = SmolVLM2Config::smolvlm2_500m();

Architecture

1

Vision encoder (SigLIP ViT)

The image is divided into patch_size × patch_size patches. Each patch is linearly projected into the vision hidden space. A standard transformer encoder (LayerNorm + full attention + LayerNorm + MLP) processes all patches in parallel.
2

Pixel shuffle connector

Vision features are spatially downsampled by scale_factor via pixel shuffle, then projected into the text decoder’s embedding dimension with a linear layer.
3

Text decoder with cross-attention

The text decoder uses LLaMA-3 style transformer blocks. Image-conditioned layers include cross-attention over the projected vision features in addition to causal self-attention over the text tokens.

Building the graph

use meganeura::{Graph, build_inference_session};
use meganeura::models::smolvlm2::{SmolVLM2Config, build_graph};

let config = SmolVLM2Config::smolvlm2_500m();
let text_seq_len = 128;

let mut g = Graph::new();
let logits = build_graph(&mut g, &config, text_seq_len);
g.set_outputs(vec![logits]);

let session = build_inference_session(&g);
The graph expects the following inputs:
  • "image_patches" — F32 tensor of shape [num_patches, patch_dim] where num_patches = (image_size/patch_size)² and patch_dim = 3 * patch_size²
  • "vision_features_shuffled" — F32 tensor of shape [num_vision_tokens, connector_input_dim] — pixel-shuffled vision features after preprocessing
  • "combined_embeds" — F32 tensor of shape [num_vision_tokens + text_seq_len, hidden_size] — concatenated vision and text embeddings
  • "token_ids" — U32 tensor of shape [text_seq_len]
You can also call build_vision_encoder(&mut g, &config.vision, num_patches) separately to build just the vision encoder subgraph.

Key parameter names

Vision encoder parameters follow the pattern:
model.vision_model.encoder.layers.{i}.self_attn.q_proj.weight
model.vision_model.encoder.layers.{i}.layer_norm1.weight
model.vision_model.encoder.layers.{i}.mlp.fc1.weight
Text decoder parameters mirror the SmolLM2 naming under model.text_model.layers.{i}.
Use examples/smolvlm2.rs in the repository for a complete weight-loading and inference walkthrough including image preprocessing.

Build docs developers (and LLMs) love