What are Embeddings?
Embeddings transform text, images, or other data into numerical vectors (lists of numbers) that capture semantic meaning. Documents with similar meanings will have similar vector representations.Why Embeddings Matter
By Analogy
By Analogy
An embedding represents the essence of a document. This enables documents and queries with the same essence to be “near” each other and therefore easy to find.
Literal Explanation
Literal Explanation
Embedding something turns it from image/text/audio into a list of numbers:This process makes documents “understandable” to a machine learning model.
Technical Definition
Technical Definition
An embedding is the latent-space position of a document at a layer of a deep neural network. For models trained specifically to embed data, this is the last layer.
Example: Photo Search
If you search your photos for “famous bridge in San Francisco”, Chroma:- Embeds the query text into a vector
- Compares it to the embeddings of your photos and their metadata
- Returns photos of the Golden Gate Bridge
Automatic Embedding
Chroma handles embeddings automatically. When you add documents without providing embeddings, Chroma will embed them for you:Default Embedding Function
By default, Chroma uses the Sentence Transformersall-MiniLM-L6-v2 model via ONNX:
- Runs locally (no API calls)
- Produces 384-dimensional embeddings
- Works well for general-purpose text
Custom Embedding Functions
You can use custom embedding functions for better performance or domain-specific needs:OpenAI Embeddings
Cohere Embeddings
Sentence Transformers
Providing Your Own Embeddings
You can provide pre-computed embeddings when adding data:Embedding Types
Chroma supports different embedding formats:Dense Vectors
Standard numerical vectors (most common):Sparse Vectors
Efficient representation for high-dimensional sparse data:- Indices must be non-negative integers
- Indices must be sorted in ascending order
- No duplicate indices allowed
indices,values, andlabels(if provided) must have the same length
Multi-Modal Embeddings
Chroma supports embedding different types of data:Text Documents
Images
URIs (URLs or file paths)
Embedding Function Interface
Create custom embedding functions by implementing theEmbeddingFunction protocol:
__call__(): Takes documents and returns embeddingsname(): Returns a unique identifierget_config(): Returns serializable configurationbuild_from_config(): Reconstructs from configuration
Query Embeddings vs Document Embeddings
Some models produce different embeddings for queries vs documents:embed_query() when you query with text.
Best Practices
Match embedding dimensions
Match embedding dimensions
All embeddings in a collection must have the same dimensions:
Choose appropriate models
Choose appropriate models
Select embedding models based on your data:
- General text:
all-MiniLM-L6-v2(default) - High quality:
text-embedding-3-large(OpenAI) - Multilingual:
embed-multilingual-v3.0(Cohere) - Code:
text-embedding-ada-002(OpenAI)
Be consistent with embedding functions
Be consistent with embedding functions
Use the same embedding function for both adding and querying:
Next Steps
Metadata
Learn about metadata and filtering
Querying
Query your embeddings with similarity search
OpenAI Embeddings
Use OpenAI embedding models
Embedding Functions
Learn about embedding functions