Transformers, originally developed for NLP, have reshaped computer vision. The Vision Transformer (ViT) treats an image as a sequence of patches and processes them with standard self-attention, rivaling and often exceeding CNN performance on large-scale benchmarks.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/domingomery/vision/llms.txt
Use this file to discover all available pages before exploring further.
Self-attention mechanism
The core of the transformer is the scaled dot-product attention operation: where (queries), (keys), and (values) are linear projections of the input sequence. The scaling prevents dot products from growing large in magnitude and saturating the softmax. Multi-head attention runs independent attention functions in parallel and concatenates the outputs: This lets the model jointly attend to information from different representation subspaces.Transformer encoder block
Each encoder layer consists of:- Multi-head self-attention
- Add & LayerNorm (residual connection)
- Feed-forward network (two linear layers with GELU)
- Add & LayerNorm
Vision Transformer (ViT)
ViT adapts the transformer encoder to images through three steps:Patchify
Divide the image into non-overlapping patches of size . This produces patches. Standard ViT-B/16 uses on images, giving patches.
Embed
Flatten each patch to a vector and project it linearly to dimension . Prepend a learnable
[CLS] token whose final representation is used for classification. Add positional embeddings (learned 1D or 2D) to preserve spatial information.ViT requires large training datasets (JFT-300M, ImageNet-21k) to outperform CNNs. On smaller datasets, convolutional inductive biases (translation equivariance, locality) give CNNs an advantage. Hybrid models combine CNN feature extractors with transformer encoders.
ViT inference with HuggingFace
CLIP: contrastive image-text pretraining
CLIP (Contrastive Language-Image Pretraining, OpenAI 2021) trains an image encoder and a text encoder jointly on 400 million (image, text) pairs from the internet. The objective aligns matching pairs close together and pushes non-matching pairs apart in a shared embedding space.Training objective
For a batch of (image, text) pairs, CLIP maximizes the cosine similarity of the correct pairs while minimizing similarity for the incorrect pairs: where and are the -normalized image and text embeddings, and is a learned temperature.Zero-shot classification with CLIP
Stable Diffusion
Stable Diffusion combines three components:| Component | Role |
|---|---|
| CLIP text encoder | Encodes the text prompt to a conditioning vector |
| UNet denoiser | Predicts noise at each diffusion step, conditioned on the text embedding |
| VAE decoder | Decodes the denoised latent to a full-resolution image |
HuggingFace Transformers for vision
Thetransformers library provides pretrained ViT, Swin Transformer, CLIP, and many other vision models with a unified API:
Resources
HuggingFace Transformers Notebook
Course notebook covering vision transformers with the HuggingFace ecosystem.
Exercise E10: Transformers
Hands-on transformer exercise using ViT and CLIP models.
Video: Transformers from Scratch
In-depth tutorial building transformers from scratch by Umar Jamil.
Diffusion Models Blog
Accessible introduction to diffusion models including DDPM and Stable Diffusion.
