Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt

Use this file to discover all available pages before exploring further.

OpenCLIP supports creating custom model architectures through JSON configuration files and flexible model building APIs. You can define custom vision encoders, text encoders, or use pre-trained models from HuggingFace as text encoders.

Model Configuration Files

Model architectures are defined in JSON configuration files located in src/open_clip/model_configs/. Each config file specifies the model’s architecture parameters.

Basic Model Config Structure

{
    "embed_dim": 512,
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 32
    },
    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 512,
        "heads": 8,
        "layers": 12
    }
}

Key Parameters

  • embed_dim: The dimension of the joint embedding space where image and text features are projected
  • vision_cfg: Configuration for the vision encoder
    • image_size: Input image resolution
    • layers: Number of transformer layers
    • width: Hidden dimension size
    • patch_size: Size of image patches for Vision Transformer
  • text_cfg: Configuration for the text encoder
    • context_length: Maximum text sequence length
    • vocab_size: Size of the vocabulary
    • width: Hidden dimension size
    • heads: Number of attention heads
    • layers: Number of transformer layers

Adding Custom Model Configs

You can add your own model configurations using the add_model_config() function:
import open_clip
from pathlib import Path

# Add a directory containing model config JSON files
open_clip.add_model_config(Path('/path/to/model_configs/'))

# Or add a single config file
open_clip.add_model_config(Path('/path/to/my_model.json'))

# Now you can use your custom model
model, _, preprocess = open_clip.create_model_and_transforms(
    'my_model',
    pretrained=None
)

Using HuggingFace Models as Text Encoders

OpenCLIP allows you to use any HuggingFace transformer model as the text encoder. This is useful for leveraging pre-trained language models or multilingual models.

HuggingFace Text Encoder Config

{
    "embed_dim": 512,
    "quick_gelu": true,
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 32
    },
    "text_cfg": {
        "hf_model_name": "roberta-base",
        "hf_tokenizer_name": "roberta-base",
        "hf_pooler_type": "mean_pooler"
    }
}

Training with HuggingFace Text Encoder

When training with a HuggingFace model as the text encoder, use the --hf-tokenizer-name parameter to specify the tokenizer:
python -m open_clip_train.main \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 10 \
    --lr 5e-4

Freezing and Unfreezing Layers

You can control which layers of the text encoder are trainable:
python -m open_clip_train.main \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --train-data "/path/to/train_data.tar" \
    --batch-size 256 \
    --epochs 10
Parameters:
  • --lock-text: Freeze the entire text encoder
  • --lock-text-unlocked-layers N: Leave the last N layer groups unfrozen for fine-tuning
  • --lock-text-freeze-layer-norm: Freeze LayerNorm running stats in locked layers

Custom Vision Architectures

OpenCLIP supports various vision encoder architectures:

Vision Transformer (ViT)

Standard Vision Transformer configuration:
{
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 16,
        "head_width": 64
    }
}

ConvNeXt

Using timm models for vision encoding:
{
    "vision_cfg": {
        "timm_model_name": "convnext_base",
        "timm_model_pretrained": true,
        "timm_pool": "avg",
        "timm_proj": "linear",
        "image_size": 256
    }
}

Creating Models Programmatically

You can also create custom models directly in Python:
import open_clip
import json
from pathlib import Path

# Define custom config
custom_config = {
    "embed_dim": 768,
    "vision_cfg": {
        "image_size": 256,
        "layers": 16,
        "width": 1024,
        "patch_size": 16
    },
    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 768,
        "heads": 12,
        "layers": 16
    }
}

# Save config to file
config_path = Path('custom_model.json')
with open(config_path, 'w') as f:
    json.dump(custom_config, f, indent=2)

# Add config and create model
open_clip.add_model_config(config_path)
model, _, preprocess = open_clip.create_model_and_transforms(
    'custom_model',
    pretrained=None
)

Available Model Configs

To see all available model configurations:
import open_clip

# List all available model architectures
models = open_clip.list_models()
print(models)

Best Practices

  1. Embed Dimension: Ensure embed_dim is consistent across vision and text towers
  2. Model Naming: Use descriptive names that indicate architecture (e.g., roberta-ViT-B-32)
  3. Configuration Testing: Test custom configs with small datasets before full training
  4. Pre-trained Weights: When using HuggingFace models, leverage their pre-trained weights for better initialization
  5. Layer Freezing: Start with more frozen layers and gradually unfreeze for fine-tuning

Example: Training Custom Model

Complete example training a custom model with RoBERTa text encoder:
python -m open_clip_train.main \
    --train-data "pipe:aws s3 cp s3://bucket/data/{00000..00329}.tar -" \
    --train-num-samples 3000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --warmup 2000 \
    --epochs 10 \
    --lr 5e-4 \
    --precision amp \
    --workers 6 \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --name "custom-clip" \
    --report-to "tensorboard"
This configuration:
  • Uses RoBERTa as the text encoder
  • Keeps the first layers of RoBERTa frozen, unfreezing the last 10 layers
  • Trains on data from S3
  • Uses automatic mixed precision for efficiency
  • Reports metrics to TensorBoard

Build docs developers (and LLMs) love