SmolVLM2

SmolVLM2 is a compact vision-language model. It processes image patches through a SigLIP-style vision transformer, then projects the resulting features into the text model’s embedding space via a pixel shuffle connector and cross-attention layers in the text decoder.

Configuration

SmolVLM2 uses two separate configs — one for the vision encoder and one for the text decoder — wrapped in SmolVLM2Config.

VisionConfig

image_size

usize

Input image size in pixels (square). SmolVLM2-500M uses 512.

patch_size

usize

Patch size in pixels (square). SmolVLM2-500M uses 16, giving 32×32 = 1024 patches.

hidden_size

usize

Hidden dimensionality of the vision transformer. SmolVLM2-500M uses 768.

num_attention_heads

u32

Number of attention heads in the vision transformer. SmolVLM2-500M uses 12.

num_hidden_layers

usize

Number of vision transformer layers. SmolVLM2-500M uses 12.

intermediate_size

usize

Vision MLP intermediate size. SmolVLM2-500M uses 3072 (4× hidden).

layer_norm_eps

f32

Layer norm epsilon. SmolVLM2-500M uses 1e-6.

TextConfig

vocab_size

usize

Vocabulary size. SmolVLM2-500M uses 49280.

hidden_size

usize

Text decoder hidden dimensionality. SmolVLM2-500M uses 960.

num_hidden_layers

usize

Number of text decoder layers. SmolVLM2-500M uses 32.

num_attention_heads

u32

Number of query heads in GQA. SmolVLM2-500M uses 15.

num_key_value_heads

u32

Number of KV heads. SmolVLM2-500M uses 5.

intermediate_size

usize

SwiGLU FFN intermediate size. SmolVLM2-500M uses 2560.

rms_norm_eps

f32

RMSNorm epsilon. SmolVLM2-500M uses 1e-5.

rope_theta

f32

RoPE base frequency. SmolVLM2-500M uses 100000.0.

SmolVLM2Config

vision

VisionConfig

Vision encoder configuration.

text

TextConfig

Text decoder configuration.

scale_factor

usize

Pixel shuffle scale factor for spatial downsampling before the connector projection. SmolVLM2-500M uses 2.

Use the built-in preset:

use meganeura::models::smolvlm2::{SmolVLM2Config, build_graph};

let config = SmolVLM2Config::smolvlm2_500m();

Architecture

Vision encoder (SigLIP ViT)

The image is divided into patch_size × patch_size patches. Each patch is linearly projected into the vision hidden space. A standard transformer encoder (LayerNorm + full attention + LayerNorm + MLP) processes all patches in parallel.

Pixel shuffle connector

Vision features are spatially downsampled by scale_factor via pixel shuffle, then projected into the text decoder’s embedding dimension with a linear layer.

Text decoder with cross-attention

The text decoder uses LLaMA-3 style transformer blocks. Image-conditioned layers include cross-attention over the projected vision features in addition to causal self-attention over the text tokens.

Building the graph

use meganeura::{Graph, build_inference_session};
use meganeura::models::smolvlm2::{SmolVLM2Config, build_graph};

let config = SmolVLM2Config::smolvlm2_500m();
let text_seq_len = 128;

let mut g = Graph::new();
let logits = build_graph(&mut g, &config, text_seq_len);
g.set_outputs(vec![logits]);

let session = build_inference_session(&g);

The graph expects the following inputs:

"image_patches" — F32 tensor of shape [num_patches, patch_dim] where num_patches = (image_size/patch_size)² and patch_dim = 3 * patch_size²
"vision_features_shuffled" — F32 tensor of shape [num_vision_tokens, connector_input_dim] — pixel-shuffled vision features after preprocessing
"combined_embeds" — F32 tensor of shape [num_vision_tokens + text_seq_len, hidden_size] — concatenated vision and text embeddings
"token_ids" — U32 tensor of shape [text_seq_len]

You can also call build_vision_encoder(&mut g, &config.vision, num_patches) separately to build just the vision encoder subgraph.

Key parameter names

Vision encoder parameters follow the pattern:

model.vision_model.encoder.layers.{i}.self_attn.q_proj.weight
model.vision_model.encoder.layers.{i}.layer_norm1.weight
model.vision_model.encoder.layers.{i}.mlp.fc1.weight

Text decoder parameters mirror the SmolLM2 naming under model.text_model.layers.{i}.

Use examples/smolvlm2.rs in the repository for a complete weight-loading and inference walkthrough including image preprocessing.

Get Started

Concepts

Training

Inference

Built-in Models

Advanced

Configuration

VisionConfig

TextConfig

SmolVLM2Config

Architecture

Building the graph

Key parameter names

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Inference

Built-in Models

Advanced

​Configuration

​VisionConfig

​TextConfig

​SmolVLM2Config

​Architecture

​Building the graph

​Key parameter names

Build docs developers (and LLMs) love

Configuration

VisionConfig

TextConfig

SmolVLM2Config

Architecture

Building the graph

Key parameter names