Documentation Index Fetch the complete documentation index at: https://mintlify.com/avnlp/vectordb/llms.txt
Use this file to discover all available pages before exploring further.
Contextual compression reduces the token count of retrieved documents by filtering or extracting only the most relevant content. This addresses LLM context window limitations and reduces generation costs.
The problem with raw retrieval
Standard RAG retrieves top-k documents and passes them to the LLM. This creates issues:
Token limits: Long documents may exceed LLM context windows
Irrelevant content: Retrieved documents often contain off-topic sections
Cost: More tokens mean higher API costs for generation
Quality: Irrelevant content can distract the LLM from the answer
Contextual compression solves this by over-fetching documents, then compressing them to only relevant passages.
Compression strategies
Reranking-based compression
Uses cross-encoder models to score document relevance, then filters to the top-k most relevant.
from vectordb.haystack.contextual_compression.search import (
PineconeContextualCompressionSearchPipeline
)
pipeline = PineconeContextualCompressionSearchPipeline(
"configs/pinecone_compression.yaml"
)
results = pipeline.search(
query = "What is quantum entanglement?" ,
top_k = 5 # Returns 5 documents after compression
)
for doc in results[ "documents" ]:
print (doc.content[: 200 ])
How it works:
Retrieve top_k * 2 documents (over-fetch)
Score each document with cross-encoder
Return only top-k highest-scoring documents
Uses an LLM to extract only relevant passages from retrieved documents.
from langchain_groq import ChatGroq
from vectordb.langchain.components import ContextCompressor
from langchain_core.documents import Document
llm = ChatGroq( model = "llama-3.3-70b-versatile" )
compressor = ContextCompressor( mode = "llm_extraction" , llm = llm)
documents = [
Document( page_content = "Long document with relevant and irrelevant sections..." ),
Document( page_content = "Another document with mixed content..." )
]
compressed = compressor.compress(
query = "What is photosynthesis?" ,
documents = documents
)
# Returns a single document with extracted passages
print (compressed[ 0 ].page_content)
How it works:
Retrieve top_k documents
LLM reads all documents and extracts relevant passages
Returns extracted content (higher compression ratio)
Configuration
Reranking mode
LLM extraction mode
compression :
type : reranking
reranker :
type : cohere
api_key : ${COHERE_API_KEY}
model : rerank-english-v3.0
top_k : 5
pinecone :
api_key : ${PINECONE_API_KEY}
index_name : documents
namespace : default
embedding :
provider : sentence_transformers
model : all-MiniLM-L6-v2
Available rerankers
VectorDB supports multiple reranking backends:
Cohere
Cross-encoder
Voyage AI
BGE
reranker :
type : cohere
api_key : ${COHERE_API_KEY}
model : rerank-english-v3.0
top_k : 5
Best for: Production use, high quality
Cost: API-based, per-request pricingreranker :
type : cross_encoder
model : BAAI/bge-reranker-v2-m3
top_k : 5
Best for: Local deployment, zero API cost
Trade-off: Requires GPU for speedreranker :
type : voyage
api_key : ${VOYAGE_API_KEY}
model : rerank-2
top_k : 5
Best for: Long documents (up to 32k tokens)
Cost: API-basedreranker :
type : bge
model : BAAI/bge-reranker-v2-m3
top_k : 5
Best for: Multilingual content (100+ languages)
Deployment: Local or API
Compression metrics
The Haystack implementation tracks compression effectiveness:
from vectordb.haystack.contextual_compression.compression_utils import (
TokenCounter
)
# Estimate tokens before and after compression
original_tokens = sum (
TokenCounter.estimate_tokens(doc.content) for doc in original_docs
)
compressed_tokens = sum (
TokenCounter.estimate_tokens(doc.content) for doc in compressed_docs
)
compression_ratio = compressed_tokens / original_tokens
tokens_saved = original_tokens - compressed_tokens
print ( f "Compression ratio: { compression_ratio :.2%} " )
print ( f "Tokens saved: { tokens_saved } " )
Example output:
Compression ratio: 30.5%
Tokens saved: 2,847
Implementation example
Here’s how Pinecone compression works under the hood (LangChain):
from vectordb.databases.pinecone import PineconeVectorDB
from vectordb.langchain.components import ContextCompressor
from vectordb.langchain.utils import EmbedderHelper, RAGHelper
class PineconeContextualCompressionSearchPipeline :
def __init__ ( self , config ):
self .embedder = EmbedderHelper.create_embedder(config)
self .db = PineconeVectorDB(
api_key = config[ "pinecone" ][ "api_key" ],
index_name = config[ "pinecone" ][ "index_name" ]
)
# Initialize compressor based on mode
if config[ "compression" ][ "mode" ] == "reranking" :
reranker = RerankerHelper.create_reranker(config)
self .compressor = ContextCompressor(
mode = "reranking" ,
reranker = reranker
)
else :
llm = ChatGroq( model = config[ "compression" ][ "llm" ][ "model" ])
self .compressor = ContextCompressor(
mode = "llm_extraction" ,
llm = llm
)
def search ( self , query , top_k = 10 ):
# Step 1: Over-fetch documents
query_embedding = EmbedderHelper.embed_query( self .embedder, query)
retrieved = self .db.query(
query_embedding = query_embedding,
top_k = top_k * 2 , # Over-fetch
namespace = self .namespace
)
# Step 2: Compress using reranker or LLM
compressed = self .compressor.compress(
query = query,
documents = retrieved,
top_k = top_k
)
return { "documents" : compressed, "query" : query}
Reranking algorithms
The compression utilities module documents different reranking approaches:
# Cross-encoder reranking (from compression_utils.py)
# Architecture: Joint encoding of query+document pairs
# How it works:
# 1. Concatenate query and document with [SEP] token
# 2. Pass through transformer encoder (BERT-like)
# 3. Output layer predicts relevance score (0-1)
# Benefits: Captures query-document interactions directly
# Trade-offs: Slower than bi-encoders (O(n) forward passes)
# Cohere API reranking
# Architecture: Cloud-based neural reranking service
# How it works:
# 1. Send query + batch of documents to Cohere API
# 2. Cohere's model computes relevance scores server-side
# 3. Returns ranked list with relevance scores (0-1)
# Benefits: No local GPU needed; constantly updated models
# Trade-offs: API latency, rate limits, cost per request
Cost comparison
Pros: Fast (~100ms), high quality, no local GPU
Cons: $2 per 1000 queries (1000 docs each)
Best for: Production with moderate query volume
Reranking (local cross-encoder)
Pros: Zero API cost, data stays local
Cons: Requires GPU, slower on CPU
Best for: High query volume or privacy requirements
When to use compression
Use reranking when
Documents are moderately long (500-2000 tokens)
You need fast compression (under 100ms)
Quality matters more than compression ratio
You want to preserve full document text
Use LLM extraction when
Documents are very long (>2000 tokens)
You need maximum compression (50-80% reduction)
Latency is acceptable (~500ms)
Extracted passages are sufficient for answers
Skip compression when
Documents are already short (under 500 tokens)
You have sufficient context window
Generation cost is not a concern
You need complete document text for citations
Combine both when
First: Rerank to filter irrelevant docs (fast)
Second: LLM extract passages from top docs (quality)
Result: Best of both - high quality, maximum compression
See also