Documentation Index
Fetch the complete documentation index at: https://mintlify.com/chroma-core/chroma/llms.txt
Use this file to discover all available pages before exploring further.
Chroma uses embedding functions to convert your documents and queries into vector representations. You can use built-in embedding functions or create your own.
Default Embedding Function
If you don’t specify an embedding function, Chroma uses the default ONNX MiniLM-L6-v2 model:
import chromadb
client = chromadb.Client()
# Uses default embedding function
collection = client.create_collection(name="my_collection")
# Add documents - embeddings are generated automatically
collection.add(
documents=["This is a document", "This is another document"],
ids=["id1", "id2"]
)
The default embedding function creates 384-dimensional vectors using the all-MiniLM-L6-v2 model.
Built-in Embedding Functions
OpenAI Embeddings
Use OpenAI’s embedding models:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
openai_ef = OpenAIEmbeddingFunction(
api_key="your-api-key",
model_name="text-embedding-ada-002"
)
collection = client.create_collection(
name="openai_collection",
embedding_function=openai_ef
)
Available OpenAI models:
text-embedding-ada-002 - Most capable, 1536 dimensions
text-embedding-3-small - Smaller, faster
text-embedding-3-large - Highest quality
Cohere Embeddings
Use Cohere’s embedding models:
from chromadb.utils.embedding_functions import CohereEmbeddingFunction
cohere_ef = CohereEmbeddingFunction(
api_key="your-api-key",
model_name="embed-english-v3.0"
)
collection = client.create_collection(
name="cohere_collection",
embedding_function=cohere_ef
)
Hugging Face Embeddings
Use any Sentence Transformers model:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
ef = SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.create_collection(
name="hf_collection",
embedding_function=ef
)
Popular models:
all-MiniLM-L6-v2 - Fast, 384 dimensions
all-mpnet-base-v2 - High quality, 768 dimensions
paraphrase-multilingual-MiniLM-L12-v2 - Multilingual support
Instructor Embeddings
Task-specific embeddings with instructions:
from chromadb.utils.embedding_functions import InstructorEmbeddingFunction
instructor_ef = InstructorEmbeddingFunction(
model_name="hkunlp/instructor-base",
instruction="Represent the document for retrieval: ",
device="cuda" # or "cpu"
)
collection = client.create_collection(
name="instructor_collection",
embedding_function=instructor_ef
)
Google Gemini Embeddings
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
gemini_ef = GoogleGenerativeAiEmbeddingFunction(
api_key="your-api-key",
model_name="models/embedding-001"
)
Amazon Bedrock Embeddings
from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction
bedrock_ef = AmazonBedrockEmbeddingFunction(
aws_access_key_id="your-access-key",
aws_secret_access_key="your-secret-key",
aws_region_name="us-east-1",
model_id="amazon.titan-embed-text-v1"
)
Custom Embedding Functions
Create your own embedding function by implementing the EmbeddingFunction protocol:
from chromadb.api.types import EmbeddingFunction, Documents
import numpy as np
from typing import List
class MyEmbeddingFunction(EmbeddingFunction[Documents]):
def __call__(self, input: Documents) -> List[List[float]]:
# Your embedding logic here
embeddings = []
for doc in input:
# Example: simple character-based embedding
embedding = [float(ord(c)) for c in doc[:10].ljust(10, ' ')]
embeddings.append(embedding)
return embeddings
@staticmethod
def name() -> str:
return "my_custom_function"
@staticmethod
def build_from_config(config: dict) -> "MyEmbeddingFunction":
return MyEmbeddingFunction()
def get_config(self) -> dict:
return {}
# Use your custom function
my_ef = MyEmbeddingFunction()
collection = client.create_collection(
name="custom_collection",
embedding_function=my_ef
)
Custom Function Requirements
Your embedding function must:
- Implement
__call__ - Takes Documents and returns List[List[float]]
- Implement
name() - Returns a unique identifier
- Implement
build_from_config() - Recreates function from config
- Implement
get_config() - Returns serializable configuration
- Return consistent dimensions - All embeddings must have same length
Advanced Custom Function
from chromadb.api.types import EmbeddingFunction, Documents
import requests
class RemoteEmbeddingFunction(EmbeddingFunction[Documents]):
def __init__(self, api_url: str, api_key: str):
self._api_url = api_url
self._api_key = api_key
def __call__(self, input: Documents) -> List[List[float]]:
response = requests.post(
self._api_url,
headers={"Authorization": f"Bearer {self._api_key}"},
json={"texts": input}
)
return response.json()["embeddings"]
@staticmethod
def name() -> str:
return "remote_embedding_function"
@staticmethod
def build_from_config(config: dict) -> "RemoteEmbeddingFunction":
return RemoteEmbeddingFunction(
api_url=config["api_url"],
api_key=config["api_key"]
)
def get_config(self) -> dict:
return {
"api_url": self._api_url,
"api_key": self._api_key
}
Embedding Function Configuration
Distance Metrics
Different embedding functions work best with different distance metrics:
collection = client.create_collection(
name="my_collection",
embedding_function=openai_ef,
metadata={"hnsw:space": "cosine"} # or "l2" or "ip"
)
Distance metrics:
cosine - Cosine similarity (default, recommended for most models)
l2 - Euclidean distance
ip - Inner product
Query vs. Document Embeddings
Some embedding functions support different embeddings for queries vs documents:
class AsymmetricEmbeddingFunction(EmbeddingFunction[Documents]):
def __call__(self, input: Documents) -> List[List[float]]:
# Embed documents
return self._embed_documents(input)
def embed_query(self, input: Documents) -> List[List[float]]:
# Different embedding for queries
return self._embed_query(input)
Working with Multiple Modalities
Multimodal Embeddings
For images and text:
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
clip_ef = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()
collection = client.create_collection(
name="multimodal_collection",
embedding_function=clip_ef,
data_loader=image_loader
)
# Add images by URI
collection.add(
ids=["img1"],
uris=["path/to/image.jpg"]
)
# Query with text
results = collection.query(
query_texts=["a photo of a cat"],
n_results=5
)
Batch Processing
Embedding functions process documents in batches:
# Efficient: batch processing
collection.add(
documents=[...], # Large list processed in batches
ids=[...]
)
# Inefficient: one at a time
for doc, id in zip(documents, ids):
collection.add(documents=[doc], ids=[id])
Caching
Implement caching in custom functions:
from functools import lru_cache
class CachedEmbeddingFunction(EmbeddingFunction[Documents]):
@lru_cache(maxsize=1000)
def _embed_single(self, text: str) -> List[float]:
# Expensive embedding operation
return compute_embedding(text)
def __call__(self, input: Documents) -> List[List[float]]:
return [self._embed_single(doc) for doc in input]
Troubleshooting
Dimension Mismatch
# Error: embedding dimensions don't match
# Solution: ensure all embeddings have same dimension
API Rate Limits
from tenacity import retry, wait_exponential
class RateLimitedEmbeddingFunction(EmbeddingFunction[Documents]):
@retry(wait=wait_exponential(multiplier=1, min=4, max=60))
def __call__(self, input: Documents) -> List[List[float]]:
# API call with automatic retries
return api_call(input)
Memory Issues
For large batches, process in chunks:
def __call__(self, input: Documents) -> List[List[float]]:
batch_size = 100
embeddings = []
for i in range(0, len(input), batch_size):
batch = input[i:i + batch_size]
embeddings.extend(self._embed_batch(batch))
return embeddings