Documentation Index Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt
Use this file to discover all available pages before exploring further.
What is CoCa?
CoCa (Contrastive Captioner) is an extension of CLIP that combines:
Contrastive Learning : Standard CLIP image-text matching
Generative Captioning : Auto-regressive caption generation
This dual objective enables CoCa models to:
Perform zero-shot image classification (like CLIP)
Generate natural language captions for images
Achieve better representations through the combined training signal
Paper : CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa Architecture
CoCa adds a multimodal text decoder on top of the standard CLIP architecture:
[Image] → Image Encoder → Image Features →┬─ Contrastive Loss
│
[Text] → Text Encoder → Text Features ─┴─ Caption Loss
│
└─ Multimodal Decoder → Generated Caption
Key components:
Image Encoder : Same as CLIP (ViT, ResNet, etc.)
Unimodal Text Encoder : Encodes text for contrastive learning
Multimodal Text Decoder : Cross-attends to image features to generate captions
Available CoCa Models
OpenCLIP provides several CoCa model configurations:
Model Configs
Model Image Encoder Text Encoder Multimodal Decoder coca_baseViT-B/16 Transformer 12-layer Transformer coca_ViT-B-32ViT-B/32 Transformer 12-layer Transformer coca_ViT-L-14ViT-L/14 Transformer 12-layer Transformer coca_roberta-ViT-B-32ViT-B/32 RoBERTa 12-layer Transformer
Multimodal Decoder Configuration
Example configuration from coca_ViT-B-32:
"multimodal_cfg" : {
"context_length" : 76 ,
"vocab_size" : 49408 ,
"width" : 512 ,
"heads" : 8 ,
"layers" : 12 ,
"latent_dim" : 512 ,
"attn_pooler_heads" : 8
}
Training CoCa from Scratch
Basic CoCa Training
Train CoCa with both contrastive and captioning objectives:
python -m open_clip_train.main \
--model coca_ViT-L-14 \
--train-data "/data/train.tar" \
--train-num-samples 10000000 \
--dataset-type webdataset \
--batch-size 128 \
--precision amp \
--workers 8 \
--lr 1e-3 \
--wd 0.1 \
--warmup 10000 \
--epochs 32 \
--coca-contrastive-loss-weight 1.0 \
--coca-caption-loss-weight 2.0 \
--report-to wandb
Loss weights:
--coca-contrastive-loss-weight 1.0: Weight for CLIP contrastive loss
--coca-caption-loss-weight 2.0: Weight for caption generation loss
Multi-GPU CoCa Training
torchrun --nproc_per_node 8 -m open_clip_train.main \
--model coca_ViT-L-14 \
--train-data "/data/laion400m/{00000..41455}.tar" \
--train-num-samples 400000000 \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 128 \
--precision amp \
--grad-checkpointing \
--workers 8 \
--lr 5e-4 \
--wd 0.2 \
--warmup 10000 \
--epochs 32 \
--coca-contrastive-loss-weight 1.0 \
--coca-caption-loss-weight 2.0 \
--local-loss \
--gather-with-grad \
--report-to wandb
Fine-tuning CoCa
Fine-tuning on MSCOCO Captions
OpenCLIP provides a pretrained CoCa model that can be fine-tuned for captioning:
python -m open_clip_train.main \
--model coca_ViT-L-14 \
--pretrained laion2b_s13b_b90k \
--dataset-type csv \
--train-data "/data/mscoco/train2014.csv" \
--csv-img-key filepath \
--csv-caption-key title \
--csv-separator "\t" \
--warmup 1000 \
--batch-size 128 \
--lr 1e-5 \
--wd 0.1 \
--epochs 1 \
--workers 3 \
--coca-contrastive-loss-weight 0 \
--coca-caption-loss-weight 1 \
--report-to wandb \
--log-every-n-steps 100
Key changes for fine-tuning:
--pretrained laion2b_s13b_b90k: Start from pretrained weights
--lr 1e-5: Lower learning rate for fine-tuning
--epochs 1: Fine-tune for fewer epochs
--coca-contrastive-loss-weight 0: Disable contrastive loss (captioning only)
--coca-caption-loss-weight 1: Only train the generative head
Preparing MSCOCO Data
Create a CSV file with image paths and captions using CLIP_benchmark :
from clip_benchmark.datasets.builder import build_dataset
import pandas as pd
import os
root_path = "path/to/data/dir" # Set this to your data directory
# Download and load MSCOCO
ds = build_dataset( "mscoco_captions" , root = root_path, split = "train" , task = "captioning" )
coco = ds.coco
imgs = coco.loadImgs(coco.getImgIds())
# Create CSV with all image-caption pairs
future_df = { "filepath" : [], "title" : []}
for img in imgs:
caps = coco.imgToAnns[img[ "id" ]]
for cap in caps:
future_df[ "filepath" ].append(img[ "file_name" ])
future_df[ "title" ].append(cap[ "caption" ])
# Save to CSV
pd.DataFrame.from_dict(future_df).to_csv(
os.path.join(root_path, "train2014.csv" ),
index = False ,
sep = " \t "
)
This creates a tab-separated CSV:
filepath\ttitle
COCO_train2014_000000000009.jpg\tA person on a motorcycle on a dirt road
COCO_train2014_000000000009.jpg\tA man riding a motorcycle down a dirt road
...
Generating Captions with CoCa
Basic Caption Generation
import open_clip
import torch
from PIL import Image
# Load pretrained CoCa model
model, _, transform = open_clip.create_model_and_transforms(
model_name = "coca_ViT-L-14" ,
pretrained = "mscoco_finetuned_laion2B-s13B-b90k"
)
# Load and preprocess image
im = Image.open( "cat.jpg" ).convert( "RGB" )
im = transform(im).unsqueeze( 0 )
# Generate caption
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(im)
# Decode and print
caption = open_clip.decode(generated[ 0 ]).split( "<end_of_text>" )[ 0 ].replace( "<start_of_text>" , "" )
print (caption)
# Output: "a cat sitting on a windowsill"
Batch Caption Generation
import open_clip
import torch
from PIL import Image
model, _, transform = open_clip.create_model_and_transforms(
model_name = "coca_ViT-L-14" ,
pretrained = "mscoco_finetuned_laion2B-s13B-b90k"
)
model.eval()
# Load multiple images
images = [
Image.open( "cat.jpg" ).convert( "RGB" ),
Image.open( "dog.jpg" ).convert( "RGB" ),
Image.open( "car.jpg" ).convert( "RGB" ),
]
# Preprocess
images_tensor = torch.stack([transform(im) for im in images])
# Generate captions
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(images_tensor)
# Decode all captions
for i, gen in enumerate (generated):
caption = open_clip.decode(gen).split( "<end_of_text>" )[ 0 ].replace( "<start_of_text>" , "" )
print ( f "Image { i } : { caption } " )
Advanced Generation Options
# Generate with custom parameters
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(
im,
seq_len = 77 , # Maximum sequence length
temperature = 1.0 , # Sampling temperature
top_p = 0.9 , # Nucleus sampling
)
CoCa vs CLIP
When to Use CoCa
✅ Use CoCa when:
You need both contrastive and generative capabilities
Image captioning is important for your application
You want richer image-text representations
You have data with detailed captions
When to Use CLIP
✅ Use CLIP when:
You only need contrastive learning (classification, retrieval)
Training speed is critical (CoCa is slower due to caption generation)
You have limited compute resources
Your captions are short or simple
Training Time Comparison
Model Architecture Training Speed (relative) Memory Usage (relative) CLIP ViT-L/14 Dual encoder 1.0× 1.0× CoCa ViT-L/14 Dual encoder + decoder 0.6× 1.4×
CoCa is slower due to the autoregressive caption generation during training.
Example Training Configurations
Small-Scale CoCa Training
# CoCa ViT-B/32 on CC3M
torchrun --nproc_per_node 4 -m open_clip_train.main \
--model coca_ViT-B-32 \
--train-data "/data/cc3m/train-{0000..0331}.tar" \
--train-num-samples 3000000 \
--dataset-type webdataset \
--batch-size 256 \
--precision amp \
--workers 4 \
--lr 1e-3 \
--wd 0.1 \
--warmup 5000 \
--epochs 30 \
--coca-contrastive-loss-weight 1.0 \
--coca-caption-loss-weight 2.0 \
--report-to tensorboard
Large-Scale CoCa Training
# CoCa ViT-L/14 on LAION-400M
srun python -u src/open_clip_train/main.py \
--model coca_ViT-L-14 \
--train-data "/data/laion400m/{00000..41455}.tar" \
--train-num-samples 400000000 \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 128 \
--precision amp \
--grad-checkpointing \
--workers 8 \
--lr 5e-4 \
--wd 0.2 \
--warmup 10000 \
--epochs 32 \
--coca-contrastive-loss-weight 1.0 \
--coca-caption-loss-weight 2.0 \
--local-loss \
--gather-with-grad \
--report-to wandb \
--remote-sync s3://bucket/coca-checkpoints
CoCa with RoBERTa Text Encoder
# Use RoBERTa for the unimodal text encoder
python -m open_clip_train.main \
--model coca_roberta-ViT-B-32 \
--train-data "/data/train.tar" \
--batch-size 256 \
--precision amp \
--workers 4 \
--lr 5e-4 \
--wd 0.1 \
--warmup 2000 \
--epochs 10 \
--lock-text \
--lock-text-unlocked-layers 10 \
--coca-contrastive-loss-weight 1.0 \
--coca-caption-loss-weight 2.0
Pretrained CoCa Models
OpenCLIP provides pretrained CoCa models:
import open_clip
# List available CoCa models
models = open_clip.list_pretrained()
coca_models = [m for m in models if 'coca' in m[ 0 ].lower()]
for model_name, pretrained in coca_models:
print ( f " { model_name } : { pretrained } " )
# Load pretrained CoCa
model, _, preprocess = open_clip.create_model_and_transforms(
'coca_ViT-L-14' ,
pretrained = 'mscoco_finetuned_laion2B-s13B-b90k'
)
Available pretrained weights:
laion2b_s13b_b90k: Pretrained on LAION-2B
mscoco_finetuned_laion2B-s13B-b90k: LAION-2B pretraining + MSCOCO fine-tuning
Using CoCa for Multiple Tasks
Image Classification (Zero-Shot)
import open_clip
import torch
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms(
'coca_ViT-L-14' ,
pretrained = 'mscoco_finetuned_laion2B-s13B-b90k'
)
tokenizer = open_clip.get_tokenizer( 'coca_ViT-L-14' )
image = preprocess(Image.open( "dog.jpg" )).unsqueeze( 0 )
text = tokenizer([ "a dog" , "a cat" , "a car" ])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm( dim =- 1 , keepdim = True )
text_features /= text_features.norm( dim =- 1 , keepdim = True )
similarity = ( 100.0 * image_features @ text_features.T).softmax( dim =- 1 )
print ( "Probabilities:" , similarity)
Image Captioning
# Same model, different usage
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(image)
caption = open_clip.decode(generated[ 0 ]).split( "<end_of_text>" )[ 0 ].replace( "<start_of_text>" , "" )
print ( "Caption:" , caption)
Image-Text Retrieval
# Encode multiple images and texts
images = torch.stack([preprocess(Image.open( f "image_ { i } .jpg" )) for i in range ( 10 )])
texts = tokenizer([ f "caption { i } " for i in range ( 100 )])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(images)
text_features = model.encode_text(texts)
# Normalize
image_features /= image_features.norm( dim =- 1 , keepdim = True )
text_features /= text_features.norm( dim =- 1 , keepdim = True )
# Compute similarity matrix
similarity = image_features @ text_features.T # [10, 100]
# Top-5 captions for each image
top5 = similarity.topk( 5 , dim =- 1 )
print ( "Top 5 matches:" , top5.indices)
Tips for Training CoCa
Training tips:
Start with contrastive-only training, then add caption loss gradually
Use higher weight for caption loss (2.0 vs 1.0 for contrastive)
Fine-tune on high-quality caption datasets (MSCOCO) for best generation
Use gradient checkpointing for memory efficiency with large models
Caption generation is slower - expect 40-60% of CLIP training speed
Common issues:
CoCa requires more memory due to the multimodal decoder
Gradient accumulation is not compatible with distillation for CoCa
Caption quality depends heavily on training data quality
Very short captions may not benefit from the generative objective
Credits
CoCa implementation in OpenCLIP:
Next Steps
Training Overview Learn about general CLIP training
Fine-tuning Fine-tune CoCa models on custom datasets
Configuration Explore all CoCa training parameters
Inference Use pretrained CoCa for captioning and classification