CoCa Training - OpenCLIP

What is CoCa?

CoCa (Contrastive Captioner) is an extension of CLIP that combines:

Contrastive Learning: Standard CLIP image-text matching
Generative Captioning: Auto-regressive caption generation

This dual objective enables CoCa models to:

Perform zero-shot image classification (like CLIP)
Generate natural language captions for images
Achieve better representations through the combined training signal

Paper: CoCa: Contrastive Captioners are Image-Text Foundation Models

CoCa Architecture

CoCa adds a multimodal text decoder on top of the standard CLIP architecture:

[Image] → Image Encoder → Image Features →┬─ Contrastive Loss
                                          │
[Text]  → Text Encoder   → Text Features  ─┴─ Caption Loss
              │
              └─ Multimodal Decoder → Generated Caption

Key components:

Image Encoder: Same as CLIP (ViT, ResNet, etc.)
Unimodal Text Encoder: Encodes text for contrastive learning
Multimodal Text Decoder: Cross-attends to image features to generate captions

Available CoCa Models

OpenCLIP provides several CoCa model configurations:

Model Configs

Model	Image Encoder	Text Encoder	Multimodal Decoder
`coca_base`	ViT-B/16	Transformer	12-layer Transformer
`coca_ViT-B-32`	ViT-B/32	Transformer	12-layer Transformer
`coca_ViT-L-14`	ViT-L/14	Transformer	12-layer Transformer
`coca_roberta-ViT-B-32`	ViT-B/32	RoBERTa	12-layer Transformer

Multimodal Decoder Configuration

Example configuration from coca_ViT-B-32:

"multimodal_cfg": {
    "context_length": 76,
    "vocab_size": 49408,
    "width": 512,
    "heads": 8,
    "layers": 12,
    "latent_dim": 512,
    "attn_pooler_heads": 8
}

Training CoCa from Scratch

Basic CoCa Training

Train CoCa with both contrastive and captioning objectives:

python -m open_clip_train.main \
    --model coca_ViT-L-14 \
    --train-data "/data/train.tar" \
    --train-num-samples 10000000 \
    --dataset-type webdataset \
    --batch-size 128 \
    --precision amp \
    --workers 8 \
    --lr 1e-3 \
    --wd 0.1 \
    --warmup 10000 \
    --epochs 32 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --report-to wandb

Loss weights:

--coca-contrastive-loss-weight 1.0: Weight for CLIP contrastive loss
--coca-caption-loss-weight 2.0: Weight for caption generation loss

Multi-GPU CoCa Training

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model coca_ViT-L-14 \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --lr 5e-4 \
    --wd 0.2 \
    --warmup 10000 \
    --epochs 32 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb

Fine-tuning CoCa

Fine-tuning on MSCOCO Captions

OpenCLIP provides a pretrained CoCa model that can be fine-tuned for captioning:

python -m open_clip_train.main \
    --model coca_ViT-L-14 \
    --pretrained laion2b_s13b_b90k \
    --dataset-type csv \
    --train-data "/data/mscoco/train2014.csv" \
    --csv-img-key filepath \
    --csv-caption-key title \
    --csv-separator "\t" \
    --warmup 1000 \
    --batch-size 128 \
    --lr 1e-5 \
    --wd 0.1 \
    --epochs 1 \
    --workers 3 \
    --coca-contrastive-loss-weight 0 \
    --coca-caption-loss-weight 1 \
    --report-to wandb \
    --log-every-n-steps 100

Key changes for fine-tuning:

--pretrained laion2b_s13b_b90k: Start from pretrained weights
--lr 1e-5: Lower learning rate for fine-tuning
--epochs 1: Fine-tune for fewer epochs
--coca-contrastive-loss-weight 0: Disable contrastive loss (captioning only)
--coca-caption-loss-weight 1: Only train the generative head

Preparing MSCOCO Data

Create a CSV file with image paths and captions using CLIP_benchmark:

from clip_benchmark.datasets.builder import build_dataset
import pandas as pd
import os

root_path = "path/to/data/dir"  # Set this to your data directory

# Download and load MSCOCO
ds = build_dataset("mscoco_captions", root=root_path, split="train", task="captioning")
coco = ds.coco
imgs = coco.loadImgs(coco.getImgIds())

# Create CSV with all image-caption pairs
future_df = {"filepath": [], "title": []}
for img in imgs:
    caps = coco.imgToAnns[img["id"]]
    for cap in caps:
        future_df["filepath"].append(img["file_name"])
        future_df["title"].append(cap["caption"])

# Save to CSV
pd.DataFrame.from_dict(future_df).to_csv(
    os.path.join(root_path, "train2014.csv"),
    index=False,
    sep="\t"
)

This creates a tab-separated CSV:

filepath\ttitle
COCO_train2014_000000000009.jpg\tA person on a motorcycle on a dirt road
COCO_train2014_000000000009.jpg\tA man riding a motorcycle down a dirt road
...

Generating Captions with CoCa

Basic Caption Generation

import open_clip
import torch
from PIL import Image

# Load pretrained CoCa model
model, _, transform = open_clip.create_model_and_transforms(
    model_name="coca_ViT-L-14",
    pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)

# Load and preprocess image
im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)

# Generate caption
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(im)

# Decode and print
caption = open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", "")
print(caption)
# Output: "a cat sitting on a windowsill"

Batch Caption Generation

import open_clip
import torch
from PIL import Image

model, _, transform = open_clip.create_model_and_transforms(
    model_name="coca_ViT-L-14",
    pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)
model.eval()

# Load multiple images
images = [
    Image.open("cat.jpg").convert("RGB"),
    Image.open("dog.jpg").convert("RGB"),
    Image.open("car.jpg").convert("RGB"),
]

# Preprocess
images_tensor = torch.stack([transform(im) for im in images])

# Generate captions
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(images_tensor)

# Decode all captions
for i, gen in enumerate(generated):
    caption = open_clip.decode(gen).split("<end_of_text>")[0].replace("<start_of_text>", "")
    print(f"Image {i}: {caption}")

Advanced Generation Options

# Generate with custom parameters
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(
        im,
        seq_len=77,           # Maximum sequence length
        temperature=1.0,      # Sampling temperature
        top_p=0.9,           # Nucleus sampling
    )

CoCa vs CLIP

When to Use CoCa

✅ Use CoCa when:

You need both contrastive and generative capabilities
Image captioning is important for your application
You want richer image-text representations
You have data with detailed captions

When to Use CLIP

✅ Use CLIP when:

You only need contrastive learning (classification, retrieval)
Training speed is critical (CoCa is slower due to caption generation)
You have limited compute resources
Your captions are short or simple

Training Time Comparison

Model	Architecture	Training Speed (relative)	Memory Usage (relative)
CLIP ViT-L/14	Dual encoder	1.0×	1.0×
CoCa ViT-L/14	Dual encoder + decoder	0.6×	1.4×

CoCa is slower due to the autoregressive caption generation during training.

Example Training Configurations

Small-Scale CoCa Training

# CoCa ViT-B/32 on CC3M
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --model coca_ViT-B-32 \
    --train-data "/data/cc3m/train-{0000..0331}.tar" \
    --train-num-samples 3000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --lr 1e-3 \
    --wd 0.1 \
    --warmup 5000 \
    --epochs 30 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --report-to tensorboard

Large-Scale CoCa Training

# CoCa ViT-L/14 on LAION-400M
srun python -u src/open_clip_train/main.py \
    --model coca_ViT-L-14 \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --lr 5e-4 \
    --wd 0.2 \
    --warmup 10000 \
    --epochs 32 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb \
    --remote-sync s3://bucket/coca-checkpoints

CoCa with RoBERTa Text Encoder

# Use RoBERTa for the unimodal text encoder
python -m open_clip_train.main \
    --model coca_roberta-ViT-B-32 \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --lr 5e-4 \
    --wd 0.1 \
    --warmup 2000 \
    --epochs 10 \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0

Pretrained CoCa Models

OpenCLIP provides pretrained CoCa models:

import open_clip

# List available CoCa models
models = open_clip.list_pretrained()
coca_models = [m for m in models if 'coca' in m[0].lower()]
for model_name, pretrained in coca_models:
    print(f"{model_name}: {pretrained}")

# Load pretrained CoCa
model, _, preprocess = open_clip.create_model_and_transforms(
    'coca_ViT-L-14',
    pretrained='mscoco_finetuned_laion2B-s13B-b90k'
)

Available pretrained weights:

laion2b_s13b_b90k: Pretrained on LAION-2B
mscoco_finetuned_laion2B-s13B-b90k: LAION-2B pretraining + MSCOCO fine-tuning

Using CoCa for Multiple Tasks

Image Classification (Zero-Shot)

import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    'coca_ViT-L-14',
    pretrained='mscoco_finetuned_laion2B-s13B-b90k'
)
tokenizer = open_clip.get_tokenizer('coca_ViT-L-14')

image = preprocess(Image.open("dog.jpg")).unsqueeze(0)
text = tokenizer(["a dog", "a cat", "a car"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
print("Probabilities:", similarity)

Image Captioning

# Same model, different usage
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(image)

caption = open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", "")
print("Caption:", caption)

Image-Text Retrieval

# Encode multiple images and texts
images = torch.stack([preprocess(Image.open(f"image_{i}.jpg")) for i in range(10)])
texts = tokenizer([f"caption {i}" for i in range(100)])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(images)
    text_features = model.encode_text(texts)
    
    # Normalize
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Compute similarity matrix
    similarity = image_features @ text_features.T  # [10, 100]
    
    # Top-5 captions for each image
    top5 = similarity.topk(5, dim=-1)
    print("Top 5 matches:", top5.indices)

Tips for Training CoCa

Training tips:

Start with contrastive-only training, then add caption loss gradually
Use higher weight for caption loss (2.0 vs 1.0 for contrastive)
Fine-tune on high-quality caption datasets (MSCOCO) for best generation
Use gradient checkpointing for memory efficiency with large models
Caption generation is slower - expect 40-60% of CLIP training speed

Common issues:

CoCa requires more memory due to the multimodal decoder
Gradient accumulation is not compatible with distillation for CoCa
Caption quality depends heavily on training data quality
Very short captions may not benefit from the generative objective

Credits

CoCa implementation in OpenCLIP:

Initial implementation: lucidrains
Adaptation to OpenCLIP: gpucce
Training: iejMac

Next Steps

Training Overview

Learn about general CLIP training

Fine-tuning

Fine-tune CoCa models on custom datasets

Configuration

Explore all CoCa training parameters

Inference

Use pretrained CoCa for captioning and classification

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

Documentation Index

​What is CoCa?

​CoCa Architecture

​Available CoCa Models

​Model Configs

​Multimodal Decoder Configuration

​Training CoCa from Scratch

​Basic CoCa Training

​Multi-GPU CoCa Training

​Fine-tuning CoCa

​Fine-tuning on MSCOCO Captions

​Preparing MSCOCO Data

​Generating Captions with CoCa

​Basic Caption Generation

​Batch Caption Generation

​Advanced Generation Options

​CoCa vs CLIP

​When to Use CoCa

​When to Use CLIP

​Training Time Comparison

​Example Training Configurations

​Small-Scale CoCa Training

​Large-Scale CoCa Training

​CoCa with RoBERTa Text Encoder

​Pretrained CoCa Models

​Using CoCa for Multiple Tasks

​Image Classification (Zero-Shot)

​Image Captioning

​Image-Text Retrieval

​Tips for Training CoCa

​Credits

​Next Steps