tokenize

Overview

Converts text strings into token tensors suitable for CLIP text encoders. Uses the default SimpleTokenizer with BPE encoding.

Function Signature

def tokenize(
    texts: Union[str, List[str]], 
    context_length: int = 77
) -> torch.LongTensor

Parameters

texts

Union[str, List[str]]

required

Input text string or list of text strings to tokenize. Text is automatically cleaned and normalized.

context_length

int

default:"77"

Maximum sequence length for tokenization. Sequences longer than this are truncated. Default is 77 (standard for CLIP).

Returns

tokens

torch.LongTensor

2D tensor of token IDs with shape [batch_size, context_length]. Each sequence includes:

Start-of-text token (position 0)
Encoded text tokens
End-of-text token
Zero padding (if sequence is shorter than context_length)

Examples

Basic tokenization

import open_clip

# Tokenize single text
text = "a photo of a cat"
tokens = open_clip.tokenize(text)
print(tokens.shape)  # torch.Size([1, 77])
print(tokens[0, :10])  # First 10 tokens

Batch tokenization

# Tokenize multiple texts
texts = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird"
]
tokens = open_clip.tokenize(texts)
print(tokens.shape)  # torch.Size([3, 77])

Custom context length

# Use longer context for more tokens
long_text = "a very detailed description with many words"
tokens = open_clip.tokenize(long_text, context_length=128)
print(tokens.shape)  # torch.Size([1, 128])

Complete inference example

import torch
import open_clip
from PIL import Image

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
model.eval()

# Prepare text
texts = ["a cat", "a dog", "a bird"]
text_tokens = open_clip.tokenize(texts)

# Encode text
with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)

print(text_features.shape)  # torch.Size([3, 512])

Handle long text with truncation

# Long text is automatically truncated
long_description = " ".join(["word"] * 100)
tokens = open_clip.tokenize(long_description, context_length=77)

# Check for truncation (last non-zero token should be EOT)
eot_token_id = 49407
print(f"Last token is EOT: {tokens[0, -1] == eot_token_id or tokens[0, tokens[0].nonzero()[-1]] == eot_token_id}")

Token Structure

Each tokenized sequence has the following structure:

[SOT] [token_1] [token_2] ... [token_n] [EOT] [PAD] [PAD] ...

SOT: Start-of-text token (ID: 49406)
EOT: End-of-text token (ID: 49407)
PAD: Zero padding (ID: 0)

Text Preprocessing

The tokenizer automatically applies:

Basic cleaning: Fixes text encoding issues with ftfy
HTML unescaping: Decodes HTML entities
Whitespace normalization: Removes extra whitespace
Lowercasing: Converts text to lowercase (default behavior)

Notes

This function uses a module-level SimpleTokenizer instance
For custom tokenizers (HuggingFace, SigLIP), use get_tokenizer() instead
Sequences longer than context_length are truncated, with EOT token placed at the last position
Empty or very short texts still produce valid token sequences with SOT and EOT tokens

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

Overview

Function Signature

Parameters

Returns

Examples

Basic tokenization

Batch tokenization

Custom context length

Complete inference example

Handle long text with truncation

Token Structure

Text Preprocessing

Notes

See Also

Build docs developers (and LLMs) love

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

Documentation Index

​Overview

​Function Signature

​Parameters

​Returns

​Examples

​Basic tokenization

​Batch tokenization

​Custom context length

​Complete inference example

​Handle long text with truncation

​Token Structure

​Text Preprocessing

​Notes

​See Also

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Returns

Examples

Basic tokenization

Batch tokenization

Custom context length

Complete inference example

Handle long text with truncation

Token Structure

Text Preprocessing

Notes

See Also