Skip to main content
Chroma uses embedding functions to convert your documents and queries into vector representations. You can use built-in embedding functions or create your own.

Default Embedding Function

If you don’t specify an embedding function, Chroma uses the default ONNX MiniLM-L6-v2 model:
import chromadb

client = chromadb.Client()

# Uses default embedding function
collection = client.create_collection(name="my_collection")

# Add documents - embeddings are generated automatically
collection.add(
    documents=["This is a document", "This is another document"],
    ids=["id1", "id2"]
)
The default embedding function creates 384-dimensional vectors using the all-MiniLM-L6-v2 model.

Built-in Embedding Functions

OpenAI Embeddings

Use OpenAI’s embedding models:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

openai_ef = OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-ada-002"
)

collection = client.create_collection(
    name="openai_collection",
    embedding_function=openai_ef
)
Available OpenAI models:
  • text-embedding-ada-002 - Most capable, 1536 dimensions
  • text-embedding-3-small - Smaller, faster
  • text-embedding-3-large - Highest quality

Cohere Embeddings

Use Cohere’s embedding models:
from chromadb.utils.embedding_functions import CohereEmbeddingFunction

cohere_ef = CohereEmbeddingFunction(
    api_key="your-api-key",
    model_name="embed-english-v3.0"
)

collection = client.create_collection(
    name="cohere_collection",
    embedding_function=cohere_ef
)

Hugging Face Embeddings

Use any Sentence Transformers model:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

ef = SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.create_collection(
    name="hf_collection",
    embedding_function=ef
)
Popular models:
  • all-MiniLM-L6-v2 - Fast, 384 dimensions
  • all-mpnet-base-v2 - High quality, 768 dimensions
  • paraphrase-multilingual-MiniLM-L12-v2 - Multilingual support

Instructor Embeddings

Task-specific embeddings with instructions:
from chromadb.utils.embedding_functions import InstructorEmbeddingFunction

instructor_ef = InstructorEmbeddingFunction(
    model_name="hkunlp/instructor-base",
    instruction="Represent the document for retrieval: ",
    device="cuda"  # or "cpu"
)

collection = client.create_collection(
    name="instructor_collection",
    embedding_function=instructor_ef
)

Google Gemini Embeddings

from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction

gemini_ef = GoogleGenerativeAiEmbeddingFunction(
    api_key="your-api-key",
    model_name="models/embedding-001"
)

Amazon Bedrock Embeddings

from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction

bedrock_ef = AmazonBedrockEmbeddingFunction(
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    aws_region_name="us-east-1",
    model_id="amazon.titan-embed-text-v1"
)

Custom Embedding Functions

Create your own embedding function by implementing the EmbeddingFunction protocol:
from chromadb.api.types import EmbeddingFunction, Documents
import numpy as np
from typing import List

class MyEmbeddingFunction(EmbeddingFunction[Documents]):
    def __call__(self, input: Documents) -> List[List[float]]:
        # Your embedding logic here
        embeddings = []
        for doc in input:
            # Example: simple character-based embedding
            embedding = [float(ord(c)) for c in doc[:10].ljust(10, ' ')]
            embeddings.append(embedding)
        return embeddings
    
    @staticmethod
    def name() -> str:
        return "my_custom_function"
    
    @staticmethod
    def build_from_config(config: dict) -> "MyEmbeddingFunction":
        return MyEmbeddingFunction()
    
    def get_config(self) -> dict:
        return {}

# Use your custom function
my_ef = MyEmbeddingFunction()
collection = client.create_collection(
    name="custom_collection",
    embedding_function=my_ef
)

Custom Function Requirements

Your embedding function must:
  1. Implement __call__ - Takes Documents and returns List[List[float]]
  2. Implement name() - Returns a unique identifier
  3. Implement build_from_config() - Recreates function from config
  4. Implement get_config() - Returns serializable configuration
  5. Return consistent dimensions - All embeddings must have same length

Advanced Custom Function

from chromadb.api.types import EmbeddingFunction, Documents
import requests

class RemoteEmbeddingFunction(EmbeddingFunction[Documents]):
    def __init__(self, api_url: str, api_key: str):
        self._api_url = api_url
        self._api_key = api_key
    
    def __call__(self, input: Documents) -> List[List[float]]:
        response = requests.post(
            self._api_url,
            headers={"Authorization": f"Bearer {self._api_key}"},
            json={"texts": input}
        )
        return response.json()["embeddings"]
    
    @staticmethod
    def name() -> str:
        return "remote_embedding_function"
    
    @staticmethod
    def build_from_config(config: dict) -> "RemoteEmbeddingFunction":
        return RemoteEmbeddingFunction(
            api_url=config["api_url"],
            api_key=config["api_key"]
        )
    
    def get_config(self) -> dict:
        return {
            "api_url": self._api_url,
            "api_key": self._api_key
        }

Embedding Function Configuration

Distance Metrics

Different embedding functions work best with different distance metrics:
collection = client.create_collection(
    name="my_collection",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}  # or "l2" or "ip"
)
Distance metrics:
  • cosine - Cosine similarity (default, recommended for most models)
  • l2 - Euclidean distance
  • ip - Inner product

Query vs. Document Embeddings

Some embedding functions support different embeddings for queries vs documents:
class AsymmetricEmbeddingFunction(EmbeddingFunction[Documents]):
    def __call__(self, input: Documents) -> List[List[float]]:
        # Embed documents
        return self._embed_documents(input)
    
    def embed_query(self, input: Documents) -> List[List[float]]:
        # Different embedding for queries
        return self._embed_query(input)

Working with Multiple Modalities

Multimodal Embeddings

For images and text:
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader

clip_ef = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()

collection = client.create_collection(
    name="multimodal_collection",
    embedding_function=clip_ef,
    data_loader=image_loader
)

# Add images by URI
collection.add(
    ids=["img1"],
    uris=["path/to/image.jpg"]
)

# Query with text
results = collection.query(
    query_texts=["a photo of a cat"],
    n_results=5
)

Performance Considerations

Batch Processing

Embedding functions process documents in batches:
# Efficient: batch processing
collection.add(
    documents=[...],  # Large list processed in batches
    ids=[...]
)

# Inefficient: one at a time
for doc, id in zip(documents, ids):
    collection.add(documents=[doc], ids=[id])

Caching

Implement caching in custom functions:
from functools import lru_cache

class CachedEmbeddingFunction(EmbeddingFunction[Documents]):
    @lru_cache(maxsize=1000)
    def _embed_single(self, text: str) -> List[float]:
        # Expensive embedding operation
        return compute_embedding(text)
    
    def __call__(self, input: Documents) -> List[List[float]]:
        return [self._embed_single(doc) for doc in input]

Troubleshooting

Dimension Mismatch

# Error: embedding dimensions don't match
# Solution: ensure all embeddings have same dimension

API Rate Limits

from tenacity import retry, wait_exponential

class RateLimitedEmbeddingFunction(EmbeddingFunction[Documents]):
    @retry(wait=wait_exponential(multiplier=1, min=4, max=60))
    def __call__(self, input: Documents) -> List[List[float]]:
        # API call with automatic retries
        return api_call(input)

Memory Issues

For large batches, process in chunks:
def __call__(self, input: Documents) -> List[List[float]]:
    batch_size = 100
    embeddings = []
    for i in range(0, len(input), batch_size):
        batch = input[i:i + batch_size]
        embeddings.extend(self._embed_batch(batch))
    return embeddings

Build docs developers (and LLMs) love