Documentation Index Fetch the complete documentation index at: https://mintlify.com/avnlp/vectordb/llms.txt
Use this file to discover all available pages before exploring further.
Production RAG pipelines need cost controls. VectorDB provides cost-optimized strategies that reduce API calls, minimize token usage, and enable budget-aware retrieval without sacrificing quality.
Cost breakdown
Typical RAG pipeline costs:
Query embedding: $0.0001 (API-based embedder)
Sparse embedding: $0.0000 (local TF-IDF)
Vector search: $0.0000 (included in DB pricing)
Reranking (API): $0.0020 (per 1000 docs)
LLM generation: $0.0050 (per 1000 tokens)
─────────────────────────────
Total per query: ~$0.0071
At 1M queries/month: $7,100/month
Cost-optimized strategies
1. Hybrid search with local sparse embeddings
Use local TF-IDF for sparse embeddings instead of API-based models:
Haystack
LangChain (Cost-optimized)
# Standard approach (API cost for both dense and sparse)
from vectordb.haystack.hybrid_indexing import MilvusHybridSearchPipeline
pipeline = MilvusHybridSearchPipeline( "configs/milvus_triviaqa.yaml" )
result = pipeline.run( query = "machine learning" , top_k = 10 )
Savings:
Dense embedding: $0.0001 (API)
Sparse embedding: $0.0000 (local)
50% reduction in embedding costs
2. Optional generation
Allow retrieval-only mode to skip LLM generation when not needed:
from vectordb.langchain.cost_optimized_rag.search import (
ChromaCostOptimizedRAGSearchPipeline
)
pipeline = ChromaCostOptimizedRAGSearchPipeline( "config.yaml" )
# Retrieval only (no LLM cost)
result = pipeline.search( query = "photosynthesis" , top_k = 5 )
print (result[ "documents" ]) # Just documents, no answer
# With generation (LLM cost incurred)
if user_needs_answer:
answer = llm.generate(query, result[ "documents" ])
Savings:
Skip generation for 60% of queries (users just browse documents)
60% reduction in generation costs
3. Batch processing
Embed and search multiple queries in a single batch:
from vectordb.langchain.utils import EmbedderHelper
# Batch embedding (lower API cost)
queries = [
"What is AI?" ,
"Explain neural networks" ,
"How does backpropagation work?"
]
# Single API call for all queries
embeddings = embedder.embed_documents(queries)
# Batch search
results = [
vector_db.search(emb, top_k = 10 )
for emb in embeddings
]
Savings:
Batch embedding: 70% cheaper than individual calls
Reduced API overhead
4. Result caching
Cache frequent queries to avoid repeated searches:
from functools import lru_cache
class CachedRAGPipeline :
def __init__ ( self , pipeline ):
self .pipeline = pipeline
self .cache = {}
@lru_cache ( maxsize = 1000 )
def search ( self , query : str , top_k : int = 10 ):
cache_key = f " { query } _ { top_k } "
if cache_key in self .cache:
return self .cache[cache_key]
result = self .pipeline.search(query, top_k)
self .cache[cache_key] = result
return result
Savings:
30-40% cache hit rate for typical applications
30-40% reduction in total costs
5. Pre-filtering before retrieval
Narrow search space with metadata filters to reduce results:
# Unfiltered search (processes all documents)
results = pipeline.search( "machine learning" , top_k = 100 )
# Pre-filtered search (smaller search space)
results = pipeline.search(
"machine learning" ,
top_k = 10 ,
filters = {
"category" : "technology" ,
"date" : { "$gte" : "2023-01-01" }
}
)
Savings:
Smaller result sets reduce reranking costs
Faster searches reduce compute costs
Configuration
pinecone :
api_key : ${PINECONE_API_KEY}
index_name : cost-optimized
namespace : default
dimension : 384
embedding :
provider : sentence_transformers # Local, zero cost
model : all-MiniLM-L6-v2
search :
rrf_k : 60
cache_enabled : true
cache_ttl : 3600 # 1 hour
llm :
provider : groq # Cost-effective generation
model : llama-3.3-70b-versatile
api_key : ${GROQ_API_KEY}
optional : true # Allow skipping generation
reranking :
enabled : false # Disable for cost savings
# Or use local cross-encoder:
# enabled: true
# type: cross_encoder
# model: BAAI/bge-reranker-base
Implementation: Pinecone cost-optimized pipeline
Here’s how the LangChain cost-optimized pipeline reduces costs:
from vectordb.databases.pinecone import PineconeVectorDB
from vectordb.langchain.utils import (
ConfigLoader,
EmbedderHelper,
SparseEmbedder,
ResultMerger,
RAGHelper
)
class PineconeCostOptimizedRAGSearchPipeline :
def __init__ ( self , config_or_path ):
self .config = ConfigLoader.load(config_or_path)
# Dense embedder (API-based, required)
self .dense_embedder = EmbedderHelper.create_embedder( self .config)
# Sparse embedder (local, zero cost)
self .sparse_embedder = SparseEmbedder()
# Pinecone connection
self .db = PineconeVectorDB(
api_key = self .config[ "pinecone" ][ "api_key" ],
index_name = self .config[ "pinecone" ][ "index_name" ]
)
# Optional LLM (can be disabled to save cost)
self .llm = RAGHelper.create_llm( self .config)
def search ( self , query , top_k = 10 , filters = None ):
# Generate embeddings
# Dense: 1 API call
dense_query_embedding = EmbedderHelper.embed_query(
self .dense_embedder, query
)
# Sparse: local, zero cost
sparse_query_embedding = self .sparse_embedder.embed_query(query)
# Execute dual searches
dense_docs = self .db.query(
vector = dense_query_embedding,
top_k = top_k,
filter = filters
)
sparse_docs = self .db.query_with_sparse(
vector = [ 0.0 ] * self .dimension, # Placeholder
sparse_vector = sparse_query_embedding,
top_k = top_k,
filter = filters
)
# Fuse results (local RRF, zero cost)
merged = ResultMerger.merge_and_deduplicate(
[dense_docs, sparse_docs],
method = "rrf" ,
weights = [ 0.5 , 0.5 ]
)
result = { "documents" : merged[:top_k], "query" : query}
# Optional generation (controlled by config)
if self .llm is not None :
answer = RAGHelper.generate( self .llm, query, merged[:top_k])
result[ "answer" ] = answer
return result
Local sparse embedding
The SparseEmbedder uses TF-IDF locally:
from sklearn.feature_extraction.text import TfidfVectorizer
class SparseEmbedder :
def __init__ ( self , max_features = 5000 ):
self .vectorizer = TfidfVectorizer(
max_features = max_features,
stop_words = "english"
)
def embed_query ( self , query : str ) -> dict :
# Fit and transform query (local operation)
tfidf_matrix = self .vectorizer.fit_transform([query])
# Convert to sparse vector format for Pinecone
indices = tfidf_matrix.indices.tolist()
values = tfidf_matrix.data.tolist()
return { "indices" : indices, "values" : values}
def embed_documents ( self , texts : list[ str ]) -> list[ dict ]:
# Batch fit and transform
tfidf_matrix = self .vectorizer.fit_transform(texts)
sparse_vectors = []
for i in range ( len (texts)):
row = tfidf_matrix[i]
sparse_vectors.append({
"indices" : row.indices.tolist(),
"values" : row.data.tolist()
})
return sparse_vectors
Cost monitoring
Track API usage and estimated costs:
class CostTracker :
def __init__ ( self ):
self .embedding_calls = 0
self .reranking_calls = 0
self .generation_tokens = 0
def track_embedding ( self , num_queries = 1 ):
self .embedding_calls += num_queries
def track_reranking ( self , num_docs = 0 ):
self .reranking_calls += num_docs
def track_generation ( self , tokens = 0 ):
self .generation_tokens += tokens
def estimate_cost ( self ):
# Pricing (example rates)
embedding_cost = self .embedding_calls * 0.0001
reranking_cost = ( self .reranking_calls / 1000 ) * 2.0
generation_cost = ( self .generation_tokens / 1000 ) * 0.005
return {
"embedding" : embedding_cost,
"reranking" : reranking_cost,
"generation" : generation_cost,
"total" : embedding_cost + reranking_cost + generation_cost
}
# Usage
tracker = CostTracker()
for query in queries:
tracker.track_embedding()
result = pipeline.search(query)
tracker.track_generation( len (result[ "answer" ].split()))
print (tracker.estimate_cost())
Output:
{
"embedding" : 0.010 ,
"reranking" : 0.000 ,
"generation" : 0.125 ,
"total" : 0.135
}
Comparison: standard vs. cost-optimized
Standard RAG
Cost-optimized RAG
Per query:
Dense embedding (API): $0.0001
Sparse embedding (API): $0.0001
Reranking (Cohere): $0.0020
Generation (GPT-4): $0.0150
Total: $0.0172 per queryAt 1M queries/month: $17,200/month Per query:
Dense embedding (API): $0.0001
Sparse embedding (local): $0.0000
Reranking (local cross-encoder): $0.0000
Generation (Llama 3.3 on Groq): $0.0005
Total: $0.0006 per queryAt 1M queries/month: $600/month 28x cost reduction
Budget controls
Implement hard limits on costs:
class BudgetController :
def __init__ ( self , daily_budget = 100.0 ):
self .daily_budget = daily_budget
self .daily_spend = 0.0
self .last_reset = datetime.now().date()
def check_budget ( self , estimated_cost ):
# Reset counter if new day
if datetime.now().date() > self .last_reset:
self .daily_spend = 0.0
self .last_reset = datetime.now().date()
# Check if query would exceed budget
if self .daily_spend + estimated_cost > self .daily_budget:
raise BudgetExceededError(
f "Daily budget $ { self .daily_budget } would be exceeded"
)
self .daily_spend += estimated_cost
def get_remaining_budget ( self ):
return self .daily_budget - self .daily_spend
# Usage
budget = BudgetController( daily_budget = 50.0 )
try :
budget.check_budget( 0.01 ) # Estimated cost for this query
result = pipeline.search(query)
except BudgetExceededError:
# Return cached result or error message
result = { "error" : "Daily budget exceeded" }
Best practices
Use local embedders SentenceTransformers models run locally with zero API cost. Quality is comparable to API embedders for most use cases.
Cache aggressively 30-40% of queries are repeats. LRU cache with 1-hour TTL reduces costs significantly.
Skip unnecessary steps Don’t rerank if initial retrieval quality is high. Don’t generate if users just need documents.
Batch when possible Batch embedding reduces API overhead by 70%. Use for background indexing and bulk queries.
Monitor and optimize Track cost per query. Identify expensive operations and optimize hot paths.
Choose cost-effective LLMs Groq’s Llama 3.3 is 30x cheaper than GPT-4 with comparable quality for most RAG tasks.
See also