Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Converts text strings into token tensors suitable for CLIP text encoders. Uses the default SimpleTokenizer with BPE encoding.

Function Signature

def tokenize(
    texts: Union[str, List[str]], 
    context_length: int = 77
) -> torch.LongTensor

Parameters

texts
Union[str, List[str]]
required
Input text string or list of text strings to tokenize. Text is automatically cleaned and normalized.
context_length
int
default:"77"
Maximum sequence length for tokenization. Sequences longer than this are truncated. Default is 77 (standard for CLIP).

Returns

tokens
torch.LongTensor
2D tensor of token IDs with shape [batch_size, context_length]. Each sequence includes:
  • Start-of-text token (position 0)
  • Encoded text tokens
  • End-of-text token
  • Zero padding (if sequence is shorter than context_length)

Examples

Basic tokenization

import open_clip

# Tokenize single text
text = "a photo of a cat"
tokens = open_clip.tokenize(text)
print(tokens.shape)  # torch.Size([1, 77])
print(tokens[0, :10])  # First 10 tokens

Batch tokenization

# Tokenize multiple texts
texts = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird"
]
tokens = open_clip.tokenize(texts)
print(tokens.shape)  # torch.Size([3, 77])

Custom context length

# Use longer context for more tokens
long_text = "a very detailed description with many words"
tokens = open_clip.tokenize(long_text, context_length=128)
print(tokens.shape)  # torch.Size([1, 128])

Complete inference example

import torch
import open_clip
from PIL import Image

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
model.eval()

# Prepare text
texts = ["a cat", "a dog", "a bird"]
text_tokens = open_clip.tokenize(texts)

# Encode text
with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)

print(text_features.shape)  # torch.Size([3, 512])

Handle long text with truncation

# Long text is automatically truncated
long_description = " ".join(["word"] * 100)
tokens = open_clip.tokenize(long_description, context_length=77)

# Check for truncation (last non-zero token should be EOT)
eot_token_id = 49407
print(f"Last token is EOT: {tokens[0, -1] == eot_token_id or tokens[0, tokens[0].nonzero()[-1]] == eot_token_id}")

Token Structure

Each tokenized sequence has the following structure:
[SOT] [token_1] [token_2] ... [token_n] [EOT] [PAD] [PAD] ...
  • SOT: Start-of-text token (ID: 49406)
  • EOT: End-of-text token (ID: 49407)
  • PAD: Zero padding (ID: 0)

Text Preprocessing

The tokenizer automatically applies:
  1. Basic cleaning: Fixes text encoding issues with ftfy
  2. HTML unescaping: Decodes HTML entities
  3. Whitespace normalization: Removes extra whitespace
  4. Lowercasing: Converts text to lowercase (default behavior)

Notes

  • This function uses a module-level SimpleTokenizer instance
  • For custom tokenizers (HuggingFace, SigLIP), use get_tokenizer() instead
  • Sequences longer than context_length are truncated, with EOT token placed at the last position
  • Empty or very short texts still produce valid token sequences with SOT and EOT tokens

See Also

Build docs developers (and LLMs) love