Documentation Index
Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The zero_shot_metadata module provides pre-defined text templates and class names for zero-shot image classification. These templates are used with build_zero_shot_classifier to create robust classifiers without training.
Source: src/open_clip/zero_shot_metadata.py
Template Collections
OPENAI_IMAGENET_TEMPLATES
OPENAI_IMAGENET_TEMPLATES: Tuple[Callable[[str], str], ...]
A comprehensive collection of 80 text templates derived from OpenAI’s CLIP research. These templates provide diverse contextual variations to improve classification robustness.
Examples of templates:
lambda c: f'a photo of a {c}.'
lambda c: f'a bad photo of a {c}.'
lambda c: f'a photo of many {c}.'
lambda c: f'a sculpture of a {c}.'
lambda c: f'a low resolution photo of the {c}.'
lambda c: f'a rendering of a {c}.'
lambda c: f'graffiti of a {c}.'
lambda c: f'a cropped photo of the {c}.'
lambda c: f'a bright photo of a {c}.'
lambda c: f'a dark photo of the {c}.'
lambda c: f'a black and white photo of the {c}.'
lambda c: f'a painting of the {c}.'
lambda c: f'a {c} in a video game.'
lambda c: f'itap of a {c}.' (“I took a picture of”)
Total: 80 templates covering various visual styles, conditions, and contexts.
Source: src/open_clip/zero_shot_metadata.py:2
SIMPLE_IMAGENET_TEMPLATES
SIMPLE_IMAGENET_TEMPLATES: Tuple[Callable[[str], str], ...]
A smaller, curated subset of 7 templates from the OpenAI CLIP Prompt Engineering notebook. This provides a good balance between accuracy and computational efficiency.
Templates:
SIMPLE_IMAGENET_TEMPLATES = (
lambda c: f'itap of a {c}.',
lambda c: f'a bad photo of the {c}.',
lambda c: f'a origami {c}.',
lambda c: f'a photo of the large {c}.',
lambda c: f'a {c} in a video game.',
lambda c: f'art of the {c}.',
lambda c: f'a photo of the small {c}.',
)
Source: src/open_clip/zero_shot_metadata.py:88
Reference: OpenAI CLIP Prompt Engineering Notebook
Class Names
IMAGENET_CLASSNAMES
IMAGENET_CLASSNAMES: Tuple[str, ...]
Complete list of 1,000 ImageNet class names in the standard ImageNet-1K order. These are human-readable labels corresponding to ImageNet synsets.
Examples:
"tench", "goldfish", "great white shark"
"tabby cat", "tiger cat", "Persian cat"
"golden retriever", "labrador retriever"
"laptop computer", "desktop computer"
"pizza", "cheeseburger", "ice cream"
Total: 1,000 class names covering animals, objects, vehicles, food, and more.
Source: src/open_clip/zero_shot_metadata.py:99
Usage Examples
Using SIMPLE_IMAGENET_TEMPLATES
import open_clip
from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES
# Load model
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
model = model.to('cuda')
# Define custom classes with simple templates
my_classes = ['cat', 'dog', 'car', 'airplane']
classifier = build_zero_shot_classifier(
model,
tokenizer,
my_classes,
SIMPLE_IMAGENET_TEMPLATES,
device='cuda'
)
print(f"Created classifier with {len(SIMPLE_IMAGENET_TEMPLATES)} templates")
# Output: Created classifier with 7 templates
Full ImageNet Classification
import torch
import open_clip
from PIL import Image
from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import (
IMAGENET_CLASSNAMES,
OPENAI_IMAGENET_TEMPLATES
)
# Setup model
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14',
pretrained='openai'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')
device = 'cuda'
model = model.to(device)
# Build full ImageNet classifier
print(f"Building classifier for {len(IMAGENET_CLASSNAMES)} classes...")
classifier = build_zero_shot_classifier(
model,
tokenizer,
IMAGENET_CLASSNAMES,
OPENAI_IMAGENET_TEMPLATES,
num_classes_per_batch=50,
device=device,
use_tqdm=True
)
print(f"Classifier shape: {classifier.shape}") # (768, 1000) for ViT-L-14
# Classify an image
image = preprocess(Image.open('cat.jpg')).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image, normalize=True)
logits = 100.0 * image_features @ classifier
probs = logits.softmax(dim=-1)
# Get top 5 predictions
top5_probs, top5_indices = probs[0].topk(5)
for i, (prob, idx) in enumerate(zip(top5_probs, top5_indices)):
print(f"{i+1}. {IMAGENET_CLASSNAMES[idx]}: {prob.item():.2%}")
Comparing Template Sets
import torch
import open_clip
from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import (
SIMPLE_IMAGENET_TEMPLATES,
OPENAI_IMAGENET_TEMPLATES
)
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
model = model.to('cuda')
classes = ['cat', 'dog', 'bird']
# Simple templates (7 templates)
simple_classifier = build_zero_shot_classifier(
model, tokenizer, classes, SIMPLE_IMAGENET_TEMPLATES, device='cuda'
)
# Full templates (80 templates)
full_classifier = build_zero_shot_classifier(
model, tokenizer, classes, OPENAI_IMAGENET_TEMPLATES, device='cuda'
)
print(f"Simple templates: {len(SIMPLE_IMAGENET_TEMPLATES)}")
print(f"Full templates: {len(OPENAI_IMAGENET_TEMPLATES)}")
# Simple templates: 7
# Full templates: 80
Custom Classes with Pre-defined Templates
from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-16')
# Domain-specific classes with general templates
medical_classes = [
'chest x-ray',
'brain MRI',
'ultrasound',
'CT scan'
]
classifier = build_zero_shot_classifier(
model,
tokenizer,
medical_classes,
SIMPLE_IMAGENET_TEMPLATES, # Works for domain-specific classes too
device='cuda'
)
Creating Custom Template Variants
from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES
# Convert to list for modification
custom_templates = list(SIMPLE_IMAGENET_TEMPLATES)
# Add domain-specific templates
custom_templates.extend([
lambda c: f'a high quality photo of a {c}.',
lambda c: f'a professional photo of a {c}.',
])
print(f"Total templates: {len(custom_templates)}")
# Total templates: 9
classifier = build_zero_shot_classifier(
model,
tokenizer,
classnames,
custom_templates,
device='cuda'
)
Inspecting Template Output
from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES
# See what prompts are generated for a class
class_name = "cat"
print(f"Prompts for '{class_name}':")
for i, template in enumerate(SIMPLE_IMAGENET_TEMPLATES, 1):
print(f"{i}. {template(class_name)}")
# Output:
# Prompts for 'cat':
# 1. itap of a cat.
# 2. a bad photo of the cat.
# 3. a origami cat.
# 4. a photo of the large cat.
# 5. a cat in a video game.
# 6. art of the cat.
# 7. a photo of the small cat.
Template Design Notes
Why Multiple Templates?
Using multiple templates improves classification robustness by:
- Handling ambiguity - Different phrasings capture different aspects of a concept
- Averaging out noise - Multiple templates reduce sensitivity to specific wording
- Covering variations - Templates account for different visual presentations (size, quality, style)
Template Selection
- SIMPLE_IMAGENET_TEMPLATES: Use for faster inference with good accuracy (7 templates)
- OPENAI_IMAGENET_TEMPLATES: Use for best accuracy when computation time allows (80 templates)
- Custom templates: Create domain-specific templates for specialized applications
# Computation scales with number of templates
num_prompts = len(classnames) * len(templates)
# Example:
# 1000 classes × 7 templates = 7,000 text encodings
# 1000 classes × 80 templates = 80,000 text encodings
For large-scale applications, SIMPLE_IMAGENET_TEMPLATES provides the best balance of accuracy and speed.
References