Vector Databases - ML in Production Practice

Introduction

Vector databases store and query high-dimensional embeddings efficiently, enabling:

Semantic search
Retrieval-Augmented Generation (RAG)
Recommendation systems
Similarity detection
Anomaly detection

Why Vector Databases?

Semantic Search

Find similar items based on meaning, not just keywords

RAG Systems

Retrieve relevant context for LLM prompts

Scalability

Efficiently search billions of vectors

Real-time

Low-latency queries for production systems

LanceDB

LanceDB is an embedded vector database designed for AI applications.

Key Features

Embedded: No separate server required
Serverless: Works with cloud storage (S3, GCS)
Format: Built on Lance columnar format
Versioned: Built-in versioning and time travel
Multi-modal: Store vectors, text, images together

Installation

uv sync
# or
pip install lancedb sentence-transformers datasets

Building a RAG Application

Create a CLI application for semantic search over SQL questions.

Create Vector Database

vector-db/rag_cli_application.py

import random
import lancedb
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

MODEL_NAME = "paraphrase-MiniLM-L3-v2"

def create_new_vector_db(
    table_name: str = "my-rag-app",
    number_of_documents: int = 1000,
    uri=".lancedb"
):
    # Load dataset
    dataset = load_dataset("b-mc2/sql-create-context")
    docs = random.sample(list(dataset["train"]), k=number_of_documents)
    
    # Generate embeddings
    model = SentenceTransformer(MODEL_NAME)
    texts = [doc["question"] for doc in docs]
    embeddings = model.encode(texts, show_progress_bar=True)
    
    # Prepare data
    data = [
        {
            "id": idx,
            "text": texts[idx],
            "vector": embeddings[idx],
            "answer": docs[idx]["answer"],
            "question": docs[idx]["question"],
            "context": docs[idx]["context"],
        }
        for idx in range(len(texts))
    ]
    
    # Create database and index
    db = lancedb.connect(uri)
    lance_table = db.create_table(table_name, data=data)
    lance_table.create_index()
    
    print(f"Created table '{table_name}' with {number_of_documents} documents")

Components:

load_dataset: Load text-to-SQL dataset
SentenceTransformer: Generate embeddings
lancedb.connect: Create database connection
create_table: Store vectors and metadata
create_index: Build ANN index for fast search

Query Vector Database

vector-db/rag_cli_application.py

def query_existing_vector_db(
    query: str = "What was ARR last year?",
    table_name: str = "my-rag-app",
    top_n: int = 1,
    uri=".lancedb"
):
    # Encode query
    model = SentenceTransformer(MODEL_NAME)
    query_embedding = model.encode(query)
    
    # Open database
    db = lancedb.connect(uri)
    tbl = db.open_table(table_name)
    
    # Search
    results = tbl.search(query_embedding).limit(top_n).to_list()
    
    # Display results
    print("Search results:")
    for result in results:
        print("\n--- RESULT ---")
        print(f"Answer: {result['answer']}")
        print(f"Context: {result['context']}")
        print(f"Question: {result['question']}")

CLI Usage

Create database

python vector-db/rag_cli_application.py create-new-vector-db \
  --table-name test \
  --number-of-documents 300

Query database

python vector-db/rag_cli_application.py query-existing-vector-db \
  --query "complex query" \
  --table-name test

Architecture

Storage Format

LanceDB uses the Lance columnar format:

.lancedb/
├── my-rag-app.lance/
│   ├── data/
│   │   ├── chunk-0.lance
│   │   └── chunk-1.lance
│   ├── index/
│   │   └── ivf_pq.idx
│   └── manifest.json

Storage Diagram Benefits:

Columnar storage for analytics
Efficient compression
Fast filtering on metadata
Version control built-in

Indexing

LanceDB supports multiple index types:

IVF-PQ
Flat

Inverted File with Product QuantizationBest for:

Large datasets (>100k vectors)
Approximate nearest neighbor (ANN)
Trade-off: speed vs accuracy

table.create_index(
    metric="L2",
    num_partitions=256,
    num_sub_vectors=96
)

Brute-force searchBest for:

Small datasets (less than 100k vectors)
Exact nearest neighbor
Maximum accuracy required

# No index needed, uses flat search
results = table.search(query).limit(10)

Advanced Queries

Filtering

Combine vector search with SQL-like filters:

# Vector search + metadata filter
results = (
    table
    .search(query_embedding)
    .where("category = 'product'")
    .where("price < 100")
    .limit(10)
    .to_list()
)

Hybrid Search

Combine full-text and vector search:

# Create FTS index
table.create_fts_index("text")

# Hybrid query
results = (
    table
    .search(query_embedding)
    .where("text MATCH 'machine learning'")
    .limit(10)
    .to_list()
)

Reranking

Improve results with cross-encoder reranking:

from sentence_transformers import CrossEncoder

# Initial retrieval
results = table.search(query_embedding).limit(100).to_list()

# Rerank
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, r['text']) for r in results])

# Sort by reranking scores
reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)
top_results = [r[0] for r in reranked[:10]]

Embedding Models

Model Selection

MiniLM

Fast, lightweight
384 dims, ~80MB
Good for: High throughput

SBERT

Balanced
768 dims, ~400MB
Good for: General purpose

BGE

High accuracy
1024 dims, ~1GB
Good for: Quality-critical

OpenAI

State-of-art
1536 dims, API
Good for: Best results

Embedding Code

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Single text
embedding = model.encode("example text")

# Batch encoding
texts = ["text 1", "text 2", "text 3"]
embeddings = model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True,
    normalize_embeddings=True  # L2 normalization
)

Production RAG Pipeline

from typing import List, Dict
import lancedb
from sentence_transformers import SentenceTransformer
from openai import OpenAI

class RAGSystem:
    def __init__(self, db_path: str, table_name: str):
        self.db = lancedb.connect(db_path)
        self.table = self.db.open_table(table_name)
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.llm = OpenAI()
    
    def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
        """Retrieve relevant documents"""
        query_emb = self.encoder.encode(query)
        results = self.table.search(query_emb).limit(top_k).to_list()
        return results
    
    def generate(self, query: str, context: List[Dict]) -> str:
        """Generate answer using LLM"""
        context_str = "\n\n".join([doc['text'] for doc in context])
        
        prompt = f"""Answer the question based on the context below.
        
Context:
{context_str}

Question: {query}

Answer:"""
        
        response = self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content
    
    def query(self, question: str) -> Dict:
        """Full RAG pipeline"""
        # Retrieve
        context = self.retrieve(question)
        
        # Generate
        answer = self.generate(question, context)
        
        return {
            "answer": answer,
            "sources": context
        }

Performance Optimization

Batch Encoding

Process multiple texts together:

# Slow
embeddings = [model.encode(text) for text in texts]

# Fast
embeddings = model.encode(texts, batch_size=32)

5-10x speedup with batching

Index Tuning

Adjust index parameters:

table.create_index(
    metric="cosine",
    num_partitions=256,  # More = faster but less accurate
    num_sub_vectors=96,  # More = smaller but less accurate
    accelerator="cuda"   # Use GPU if available
)

Caching

Cache frequent queries:

from functools import lru_cache

@lru_cache(maxsize=1000)
def get_embedding(text: str):
    return model.encode(text)

Quantization

Reduce embedding precision:

import numpy as np

# FP32 -> FP16
embeddings_fp16 = embeddings.astype(np.float16)

# Or use int8
embeddings_int8 = (embeddings * 127).astype(np.int8)

50% memory reduction with minimal accuracy loss

Alternatives

Chroma
Weaviate
Pinecone
Qdrant

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(
    documents=["doc1", "doc2"],
    embeddings=[[1,2,3], [4,5,6]],
    ids=["id1", "id2"]
)

results = collection.query(
    query_embeddings=[[1,2,3]],
    n_results=10
)

import weaviate

client = weaviate.Client("http://localhost:8080")

client.data_object.create(
    data_object={"text": "example"},
    class_name="Document"
)

results = client.query.get(
    "Document", ["text"]
).with_near_text({"concepts": ["query"]}).do()

import pinecone

pinecone.init(api_key="...")
index = pinecone.Index("my-index")

index.upsert([
    ("id1", [1,2,3], {"text": "doc1"}),
    ("id2", [4,5,6], {"text": "doc2"})
])

results = index.query(
    vector=[1,2,3],
    top_k=10
)

from qdrant_client import QdrantClient

client = QdrantClient("localhost", port=6333)

client.upsert(
    collection_name="docs",
    points=[
        {"id": 1, "vector": [1,2,3], "payload": {"text": "doc1"}},
        {"id": 2, "vector": [4,5,6], "payload": {"text": "doc2"}}
    ]
)

results = client.search(
    collection_name="docs",
    query_vector=[1,2,3],
    limit=10
)

Best Practices

Chunk Size

Target 200-500 tokens per chunk
Use semantic chunking
Maintain context overlap

Metadata

Store source, date, author
Enable filtering by metadata
Index filterable fields

Monitoring

Track query latency
Monitor recall quality
Log user feedback

Versioning

Version embeddings
Track model changes
Enable rollback

Resources

Next Steps

Learn about Data Labeling with Argilla
Complete Practice Tasks to apply your knowledge

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Documentation Index

​Introduction

​Why Vector Databases?

Semantic Search

RAG Systems

Scalability

Real-time

​LanceDB

​Key Features

​Installation

​Building a RAG Application

​Create Vector Database

​Query Vector Database

​CLI Usage

​Architecture

​Storage Format

​Indexing

​Advanced Queries

​Filtering

​Hybrid Search

​Reranking

​Embedding Models

​Model Selection