laft.clip module provides an enhanced interface for loading and using OpenCLIP models with improved tokenization and encoding capabilities.
load_clip
Loads a CLIP model with preprocessing transforms.Model name in one of two formats:
- Model only:
"ViT-B-32"(uses default pretrained weights) - Model with pretrained:
"ViT-B-32:laion2b_s34b_b79k"(specific checkpoint)
Device to load the model on.
Directory to cache downloaded model checkpoints. If
None, uses default cache location.Additional arguments passed to
open_clip.create_model_and_transforms().The loaded CLIP model with enhanced encoding methods.
Image preprocessing function that transforms PIL Images into tensors suitable for the model.
Usage
Load and use various CLIP models:Supported Models
Common CLIP architectures:ViT-B-32,ViT-B-16,ViT-L-14- Vision Transformer modelsRN50,RN101- ResNet-based modelscoca_ViT-L-14- CoCa (Contrastive Captioners) models
:openai- Original OpenAI CLIP weights:laion2b_s34b_b79k- LAION-2B trained:laion400m_e32- LAION-400M trained
CLIP Model Interface
The returned CLIP model provides enhanced encoding methods.encode_image
Encodes images into embedding vectors.Preprocessed image tensor with shape:
[batch_size, 3, height, width]for batches[3, height, width]for single images (will be unsqueezed)
If
True, normalizes embeddings to unit length (L2 normalization).Image embeddings with shape
[batch_size, embedding_dim].Usage
encode_text
Encodes text prompts into embedding vectors with support for ensemble encoding.text
IntTensor | LongTensor | Sequence[IntTensor | LongTensor] | Sequence[str] | Sequence[Sequence[str]]
Text input in various formats:
- Tokenized tensors: Pre-tokenized text as
IntTensororLongTensor - String list:
["a photo of a cat", "a photo of a dog"] - Ensemble list:
[["a cat", "a feline"], ["a dog", "a canine"]]- each inner list is averaged into one embedding
If
True, normalizes embeddings to unit length.Text embeddings with shape:
[num_texts, embedding_dim]for simple inputs[num_ensembles, embedding_dim]for ensemble inputs (each ensemble averaged)
