Overview
Thecreate_embeddings.py script processes pre-chunked text data and generates embeddings using OpenAI’s embedding models. The generated embeddings are stored in a ChromaDB vector database for efficient similarity search and retrieval in RAG systems.
Location
Usage
What It Does
The script performs the following steps:- Load Chunks: Reads pre-processed document chunks from
data/chunks/chunks_final.json - Initialize Embeddings Model: Creates an OpenAI embeddings instance using
text-embedding-3-small - Generate Embeddings: Processes all chunk contents and generates vector embeddings
- Store in ChromaDB: Persists embeddings and metadata to
data/embeddings/chroma_db - Verify Storage: Confirms the number of vectors stored in the database
Configuration
The script uses the following default configuration:| Parameter | Value | Description |
|---|---|---|
| Chunks File | data/chunks/chunks_final.json | Input file containing document chunks |
| Database Directory | data/embeddings/chroma_db | Output directory for ChromaDB |
| Collection Name | guia_embarazo_parto | ChromaDB collection name |
| Embedding Model | text-embedding-3-small | OpenAI embedding model |
Requirements
Environment Variables
The script requires an OpenAI API key configured in a.env file:
Dependencies
langchain-openai: For OpenAI embeddingslangchain-community: For ChromaDB vector storepython-dotenv: For environment variable management
Input Format
The input JSON file (chunks_final.json) should contain an array of chunk objects with the following structure:
Output
The script creates a ChromaDB database atdata/embeddings/chroma_db with:
- Vectors: Embeddings for each chunk’s content
- Metadata: Associated metadata for each chunk (page number, section info, etc.)
- Collection: Named
guia_embarazo_parto
Example Output
Functions
load_chunks(file_path)
Loads chunk data from a JSON file.
Parameters:
file_path(Path): Path to the JSON file containing chunks
listorNone: List of chunk dictionaries if successful, None if failed
create_and_store_embeddings(chunks_data)
Creates embeddings using OpenAI and stores them in ChromaDB.
Parameters:
chunks_data(list): List of chunk dictionaries with content and metadata
- Extracts content and metadata from chunks
- Initializes OpenAI embeddings model
- Creates ChromaDB database from texts
- Persists database to disk
main()
Main execution function that orchestrates the embedding creation process.
Error Handling
The script handles several error conditions:- File Not Found: If
chunks_final.jsondoesn’t exist - Invalid JSON: If the chunks file contains malformed JSON
- API Key Missing: If
OPENAI_API_KEYis not configured - ChromaDB Errors: Issues during database creation or persistence
Notes
- The script uses absolute paths based on the script’s location to ensure robustness
- All paths are resolved relative to the project root
- The embedding model
text-embedding-3-smallprovides a good balance of quality and cost - ChromaDB automatically persists data when using
persist_directory
Related
- Chunking Process: How document chunks are created
- RAG Systems: Systems that use these embeddings for retrieval
