Documentation Index Fetch the complete documentation index at: https://mintlify.com/kyryl-opens-ml/ml-in-production-practice/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
Vector databases store and query high-dimensional embeddings efficiently, enabling:
Semantic search
Retrieval-Augmented Generation (RAG)
Recommendation systems
Similarity detection
Anomaly detection
Why Vector Databases?
Semantic Search Find similar items based on meaning, not just keywords
RAG Systems Retrieve relevant context for LLM prompts
Scalability Efficiently search billions of vectors
Real-time Low-latency queries for production systems
LanceDB
LanceDB is an embedded vector database designed for AI applications.
Key Features
Embedded : No separate server required
Serverless : Works with cloud storage (S3, GCS)
Format : Built on Lance columnar format
Versioned : Built-in versioning and time travel
Multi-modal : Store vectors, text, images together
Installation
uv sync
# or
pip install lancedb sentence-transformers datasets
Building a RAG Application
Create a CLI application for semantic search over SQL questions.
Create Vector Database
vector-db/rag_cli_application.py
import random
import lancedb
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
MODEL_NAME = "paraphrase-MiniLM-L3-v2"
def create_new_vector_db (
table_name : str = "my-rag-app" ,
number_of_documents : int = 1000 ,
uri = ".lancedb"
):
# Load dataset
dataset = load_dataset( "b-mc2/sql-create-context" )
docs = random.sample( list (dataset[ "train" ]), k = number_of_documents)
# Generate embeddings
model = SentenceTransformer( MODEL_NAME )
texts = [doc[ "question" ] for doc in docs]
embeddings = model.encode(texts, show_progress_bar = True )
# Prepare data
data = [
{
"id" : idx,
"text" : texts[idx],
"vector" : embeddings[idx],
"answer" : docs[idx][ "answer" ],
"question" : docs[idx][ "question" ],
"context" : docs[idx][ "context" ],
}
for idx in range ( len (texts))
]
# Create database and index
db = lancedb.connect(uri)
lance_table = db.create_table(table_name, data = data)
lance_table.create_index()
print ( f "Created table ' { table_name } ' with { number_of_documents } documents" )
Components :
load_dataset: Load text-to-SQL dataset
SentenceTransformer: Generate embeddings
lancedb.connect: Create database connection
create_table: Store vectors and metadata
create_index: Build ANN index for fast search
Query Vector Database
vector-db/rag_cli_application.py
def query_existing_vector_db (
query : str = "What was ARR last year?" ,
table_name : str = "my-rag-app" ,
top_n : int = 1 ,
uri = ".lancedb"
):
# Encode query
model = SentenceTransformer( MODEL_NAME )
query_embedding = model.encode(query)
# Open database
db = lancedb.connect(uri)
tbl = db.open_table(table_name)
# Search
results = tbl.search(query_embedding).limit(top_n).to_list()
# Display results
print ( "Search results:" )
for result in results:
print ( " \n --- RESULT ---" )
print ( f "Answer: { result[ 'answer' ] } " )
print ( f "Context: { result[ 'context' ] } " )
print ( f "Question: { result[ 'question' ] } " )
CLI Usage
Create database
python vector-db/rag_cli_application.py create-new-vector-db \
--table-name test \
--number-of-documents 300
Query database
python vector-db/rag_cli_application.py query-existing-vector-db \
--query "complex query" \
--table-name test
Architecture
LanceDB uses the Lance columnar format:
.lancedb/
├── my-rag-app.lance/
│ ├── data/
│ │ ├── chunk-0.lance
│ │ └── chunk-1.lance
│ ├── index/
│ │ └── ivf_pq.idx
│ └── manifest.json
Storage Diagram
Benefits :
Columnar storage for analytics
Efficient compression
Fast filtering on metadata
Version control built-in
Indexing
LanceDB supports multiple index types:
Inverted File with Product Quantization Best for:
Large datasets (>100k vectors)
Approximate nearest neighbor (ANN)
Trade-off: speed vs accuracy
table.create_index(
metric = "L2" ,
num_partitions = 256 ,
num_sub_vectors = 96
)
Brute-force search Best for:
Small datasets (less than 100k vectors)
Exact nearest neighbor
Maximum accuracy required
# No index needed, uses flat search
results = table.search(query).limit( 10 )
Advanced Queries
Filtering
Combine vector search with SQL-like filters:
# Vector search + metadata filter
results = (
table
.search(query_embedding)
.where( "category = 'product'" )
.where( "price < 100" )
.limit( 10 )
.to_list()
)
Hybrid Search
Combine full-text and vector search:
# Create FTS index
table.create_fts_index( "text" )
# Hybrid query
results = (
table
.search(query_embedding)
.where( "text MATCH 'machine learning'" )
.limit( 10 )
.to_list()
)
Reranking
Improve results with cross-encoder reranking:
from sentence_transformers import CrossEncoder
# Initial retrieval
results = table.search(query_embedding).limit( 100 ).to_list()
# Rerank
reranker = CrossEncoder( 'cross-encoder/ms-marco-MiniLM-L-6-v2' )
scores = reranker.predict([(query, r[ 'text' ]) for r in results])
# Sort by reranking scores
reranked = sorted ( zip (results, scores), key = lambda x : x[ 1 ], reverse = True )
top_results = [r[ 0 ] for r in reranked[: 10 ]]
Embedding Models
Model Selection
MiniLM Fast, lightweight
384 dims, ~80MB
Good for: High throughput
SBERT Balanced
768 dims, ~400MB
Good for: General purpose
BGE High accuracy
1024 dims, ~1GB
Good for: Quality-critical
OpenAI State-of-art
1536 dims, API
Good for: Best results
Embedding Code
from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer( 'all-MiniLM-L6-v2' )
# Single text
embedding = model.encode( "example text" )
# Batch encoding
texts = [ "text 1" , "text 2" , "text 3" ]
embeddings = model.encode(
texts,
batch_size = 32 ,
show_progress_bar = True ,
normalize_embeddings = True # L2 normalization
)
Production RAG Pipeline
from typing import List, Dict
import lancedb
from sentence_transformers import SentenceTransformer
from openai import OpenAI
class RAGSystem :
def __init__ ( self , db_path : str , table_name : str ):
self .db = lancedb.connect(db_path)
self .table = self .db.open_table(table_name)
self .encoder = SentenceTransformer( 'all-MiniLM-L6-v2' )
self .llm = OpenAI()
def retrieve ( self , query : str , top_k : int = 5 ) -> List[Dict]:
"""Retrieve relevant documents"""
query_emb = self .encoder.encode(query)
results = self .table.search(query_emb).limit(top_k).to_list()
return results
def generate ( self , query : str , context : List[Dict]) -> str :
"""Generate answer using LLM"""
context_str = " \n\n " .join([doc[ 'text' ] for doc in context])
prompt = f """Answer the question based on the context below.
Context:
{ context_str }
Question: { query }
Answer:"""
response = self .llm.chat.completions.create(
model = "gpt-4" ,
messages = [{ "role" : "user" , "content" : prompt}]
)
return response.choices[ 0 ].message.content
def query ( self , question : str ) -> Dict:
"""Full RAG pipeline"""
# Retrieve
context = self .retrieve(question)
# Generate
answer = self .generate(question, context)
return {
"answer" : answer,
"sources" : context
}
Process multiple texts together: # Slow
embeddings = [model.encode(text) for text in texts]
# Fast
embeddings = model.encode(texts, batch_size = 32 )
5-10x speedup with batching
Adjust index parameters: table.create_index(
metric = "cosine" ,
num_partitions = 256 , # More = faster but less accurate
num_sub_vectors = 96 , # More = smaller but less accurate
accelerator = "cuda" # Use GPU if available
)
Cache frequent queries: from functools import lru_cache
@lru_cache ( maxsize = 1000 )
def get_embedding ( text : str ):
return model.encode(text)
Reduce embedding precision: import numpy as np
# FP32 -> FP16
embeddings_fp16 = embeddings.astype(np.float16)
# Or use int8
embeddings_int8 = (embeddings * 127 ).astype(np.int8)
50% memory reduction with minimal accuracy loss
Alternatives
Chroma
Weaviate
Pinecone
Qdrant
import chromadb
client = chromadb.Client()
collection = client.create_collection( "docs" )
collection.add(
documents = [ "doc1" , "doc2" ],
embeddings = [[ 1 , 2 , 3 ], [ 4 , 5 , 6 ]],
ids = [ "id1" , "id2" ]
)
results = collection.query(
query_embeddings = [[ 1 , 2 , 3 ]],
n_results = 10
)
import weaviate
client = weaviate.Client( "http://localhost:8080" )
client.data_object.create(
data_object = { "text" : "example" },
class_name = "Document"
)
results = client.query.get(
"Document" , [ "text" ]
).with_near_text({ "concepts" : [ "query" ]}).do()
import pinecone
pinecone.init( api_key = "..." )
index = pinecone.Index( "my-index" )
index.upsert([
( "id1" , [ 1 , 2 , 3 ], { "text" : "doc1" }),
( "id2" , [ 4 , 5 , 6 ], { "text" : "doc2" })
])
results = index.query(
vector = [ 1 , 2 , 3 ],
top_k = 10
)
from qdrant_client import QdrantClient
client = QdrantClient( "localhost" , port = 6333 )
client.upsert(
collection_name = "docs" ,
points = [
{ "id" : 1 , "vector" : [ 1 , 2 , 3 ], "payload" : { "text" : "doc1" }},
{ "id" : 2 , "vector" : [ 4 , 5 , 6 ], "payload" : { "text" : "doc2" }}
]
)
results = client.search(
collection_name = "docs" ,
query_vector = [ 1 , 2 , 3 ],
limit = 10
)
Best Practices
Chunk Size
Target 200-500 tokens per chunk
Use semantic chunking
Maintain context overlap
Metadata
Store source, date, author
Enable filtering by metadata
Index filterable fields
Monitoring
Track query latency
Monitor recall quality
Log user feedback
Versioning
Version embeddings
Track model changes
Enable rollback
Resources
Next Steps