Documentation Index
Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt
Use this file to discover all available pages before exploring further.
Overview
TheCLIP class implements the core CLIP (Contrastive Language-Image Pre-Training) model architecture. It combines a vision encoder and text encoder to learn aligned multimodal representations through contrastive learning.
The model outputs normalized image and text features that can be compared in a shared embedding space using cosine similarity.
Class Definition
Initialization Parameters
Dimensionality of the joint embedding space for image and text features.
Configuration object for the vision encoder. Controls architecture (ViT, ResNet, or timm model), layer depth, width, and attention settings.
Configuration object for the text encoder. Specifies transformer architecture, vocabulary size, context length, and pooling strategy.
Use QuickGELU activation (as in original OpenAI models) instead of standard GELU. QuickGELU is less memory efficient but maintains compatibility with OpenAI checkpoints.
Initial value for the learned temperature parameter (logit scale) that controls the sharpness of the similarity distribution.
Optional learnable bias term added to logits. When None, no bias is used.
If True, logit_scale has shape [1] instead of []. Some training frameworks require explicit dimensions.
Precision for model computations (e.g., torch.float16, torch.bfloat16). Used for mixed precision training.
If True, forward() returns a dictionary with named outputs. If False, returns a tuple.
Attributes
- visual: Vision encoder module (VisionTransformer, ModifiedResNet, or TimmModel)
- transformer: Text transformer encoder
- token_embedding: Text token embedding layer
- positional_embedding: Learned positional embeddings for text
- ln_final: Final layer normalization for text features
- text_projection: Projection matrix from text features to joint embedding space
- logit_scale: Learned temperature parameter (exponential of stored value)
- logit_bias: Optional learned bias (if init_logit_bias is not None)
- context_length: Maximum text sequence length
- vocab_size: Size of text vocabulary
Key Methods
encode_image
image: Image tensor of shape(batch_size, channels, height, width)normalize: If True, L2-normalizes the output features
(batch_size, embed_dim)
encode_text
text: Tokenized text tensor of shape(batch_size, context_length)normalize: If True, L2-normalizes the output features
(batch_size, embed_dim)
get_logits
image: Image tensortext: Tokenized text tensor
forward
image: Optional image tensortext: Optional tokenized text tensor
- If
output_dict=True: Dictionary with keysimage_features,text_features,logit_scale, and optionallylogit_bias - If
output_dict=False: Tuple of (image_features, text_features, logit_scale) or (image_features, text_features, logit_scale, logit_bias)
forward_intermediates
lock_image_tower
unlocked_groups: Number of layer groups to keep trainable (from the end)freeze_bn_stats: If True, freezes batch normalization statistics
lock_text_tower
unlocked_layers: Number of transformer layers to keep trainable (from the end)freeze_layer_norm: If True, freezes layer normalization parameters
set_grad_checkpointing
Usage Example
Fine-tuning Example
Related
- CustomTextCLIP - Variant with separately built text tower
- CoCa - Contrastive Captioner model
- ClipLoss - Contrastive loss function for CLIP
