Skip to main content

Overview

PolyChat-AI includes RAG (Retrieval-Augmented Generation) capabilities to enhance AI responses with relevant context from your conversation history. Using local embeddings, the system retrieves semantically similar previous messages to provide better, more contextual responses.
Privacy-First: All RAG processing happens locally in your browser using WebAssembly. No conversation data is sent to external services for embedding generation.

How It Works

Architecture

Technical Stack

Embeddings Model

all-MiniLM-L6-v2 by Sentence Transformers
  • 384-dimensional vectors
  • Optimized for semantic similarity
  • Fast inference in browser

Framework

@xenova/transformers
  • Transformers.js for browser ML
  • WebAssembly acceleration
  • No external API calls

Implementation

Core RAG Service

From src/services/ragService.ts:
import { pipeline, cos_sim, FeatureExtractionPipeline } from '@xenova/transformers';
import type { Message } from '../types';

// Singleton class to ensure we only load the model once
class EmbeddingService {
  private static instance: FeatureExtractionPipeline | null = null;

  static async getInstance() {
    if (this.instance === null) {
      // Load the embedding model (happens once per session)
      this.instance = (await pipeline(
        'feature-extraction',
        'Xenova/all-MiniLM-L6-v2'
      )) as FeatureExtractionPipeline;
    }
    return this.instance;
  }
}

Embedding Generation

// Function to calculate embeddings for a batch of texts
async function getEmbeddings(texts: string[]): Promise<number[][]> {
  const extractor = await EmbeddingService.getInstance();
  
  const embeddings = await Promise.all(
    texts.map(async (text) => {
      const embedding = await extractor(text, { 
        pooling: 'mean',    // Mean pooling over tokens
        normalize: true     // L2 normalization
      });
      return embedding.tolist();
    })
  );
  
  return embeddings;
}

Context Retrieval

// Main function to get relevant context
export async function getRelevantContext(
  query: string,              // Current user message
  history: Message[],         // All previous messages
  maxMessages: number = 5     // Number of messages to retrieve
): Promise<Message[]> {
  if (history.length === 0) {
    return [];
  }

  // Prepare texts for embedding
  const queryText = query;
  const historyTexts = history.map(
    (msg) => `${msg.role}: ${msg.content}`
  );

  // Generate embeddings
  const [queryEmbedding, ...historyEmbeddings] = 
    await getEmbeddings([queryText, ...historyTexts]);

  // Calculate similarities using cosine similarity
  const similarities = historyEmbeddings.map((histEmbedding) =>
    cos_sim(queryEmbedding, histEmbedding)
  );

  // Get indices of top N most similar messages
  const topIndices = similarities
    .map((similarity, index) => ({ similarity, index }))
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, maxMessages)
    .map((item) => item.index);

  // Retrieve the most relevant messages and sort chronologically
  const relevantMessages = topIndices
    .map((index) => history[index])
    .sort((a, b) => a.timestamp.getTime() - b.timestamp.getTime());

  return relevantMessages;
}

Features

Not just keyword matching - understands meaning:
Query: "How do I optimize database performance?"

Retrieved messages (by semantic similarity):
1. "Ways to speed up SQL queries" (similarity: 0.85)
2. "Database indexing best practices" (similarity: 0.82)
3. "Improving query response time" (similarity: 0.78)

Note: No exact keyword match required!

Local Processing

Complete privacy - everything runs in your browser:
// Model loading and inference happens locally
const extractor = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2',
  {
    // Model is downloaded once and cached in browser
    // Subsequent loads are instant
    progress_callback: (progress) => {
      console.log(`Loading: ${progress.progress}%`);
    }
  }
);
Benefits:
  • No API calls for embeddings
  • No data leaves your device
  • No additional costs
  • Works offline (after initial model download)
  • Fast inference (~50ms per message)

Smart Context Selection

Retrieves up to 5 most relevant messages:
const relevantContext = await getRelevantContext(
  currentMessage,
  conversationHistory,
  5  // maxMessages - configurable
);

// Returns messages sorted by:
// 1. Semantic similarity (highest first)
// 2. Then chronologically (for context flow)
Why 5 messages?
  • Balance between context and token usage
  • Enough context for most conversations
  • Prevents context window overflow
  • Configurable if needed

Configuration

Enable/Disable RAG

Settings → Advanced → RAG (Context Enhancement)

[ ] Enable RAG for enhanced context
When to Enable:
  • ✅ Long conversations with multiple topics
  • ✅ Need to reference earlier discussions
  • ✅ Complex problem-solving over time
  • ✅ Want AI to remember context automatically
When to Disable:
  • ❌ Short, simple queries
  • ❌ Each message is independent
  • ❌ Browser performance concerns
  • ❌ Want faster response times

Performance Considerations

Initial Load

~25MB model download (one-time)
  • Cached in browser
  • Only on first use
  • Automatic background loading

Inference Speed

~50ms per message
  • Fast enough for real-time
  • Minimal impact on UX
  • WebAssembly accelerated

Memory Usage

~100MB additional RAM
  • Model in memory
  • Embeddings cached
  • Acceptable for modern browsers

Context Quality

Significantly better responses
  • Relevant history included
  • Coherent long conversations
  • Better understanding

Use Cases

1. Long Technical Discussions

1

Problem Introduction

“I’m having issues with my React app’s performance”
2

Diagnosis

“It seems to lag when scrolling through lists”AI gets context about the React app and scrolling issues.
3

Solution Exploration

“I tried using useMemo but it didn’t help”RAG retrieves previous messages about React and performance.
4

Follow-up Question

“What was that virtualization library you mentioned?”RAG finds the earlier message mentioning react-window, even if it was 20 messages ago.

2. Project Planning

Message 1: "We need to build a user authentication system"
Message 5: "Let's use JWT tokens for sessions"
Message 12: "Should we implement OAuth for social login?"
Message 20: "What database should we use for user data?"

Message 30: "Remind me what we decided about authentication?"

RAG retrieves:
- Message 1 (authentication system)
- Message 5 (JWT tokens decision)  
- Message 12 (OAuth consideration)

AI: "Based on our earlier discussion, we decided to build a JWT-based
authentication system with OAuth for social login..."

3. Code Review Across Sessions

// Earlier conversation (Session 1)
"Here's my API endpoint code: [code snippet]"
"The issue is rate limiting isn't working"

// Later conversation (Session 2 - different day)
"I want to add caching to that API endpoint we reviewed"

RAG retrieves:
- Original API code
- Rate limiting discussion
- Provides context for caching implementation

Advanced Usage

Adjusting Number of Retrieved Messages

Modify in your code:
import { getRelevantContext } from './services/ragService';

// Get more context for complex topics
const context = await getRelevantContext(
  userMessage,
  conversationHistory,
  10  // Retrieve 10 instead of default 5
);

// Get less context for simple queries
const lightContext = await getRelevantContext(
  userMessage,
  conversationHistory,
  3  // Just top 3 most relevant
);

Similarity Threshold

Filter by minimum similarity:
const relevantMessages = await getRelevantContext(
  query,
  history,
  5
);

// Filter by similarity threshold
const highQualityContext = relevantMessages.filter((msg, idx) => {
  const similarity = similarities[idx];
  return similarity > 0.7;  // Only include if >70% similar
});

Custom Embeddings

For specialized domains, you could swap the model:
// Example: Use a different embedding model
class CustomEmbeddingService {
  static async getInstance() {
    return await pipeline(
      'feature-extraction',
      'your-org/specialized-model'  // Domain-specific model
    );
  }
}

Best Practices

Ideal Scenarios:
  • Conversations spanning multiple sessions
  • Complex problem-solving requiring history
  • Technical support or debugging
  • Project planning and decision tracking
  • Learning sessions with progressive topics
Not Necessary For:
  • Single-question queries
  • Independent tasks
  • Template-based conversations
  • Quick factual questions
Reduce Initial Load Time:
  • Model loads on first RAG usage
  • Pre-load if you know you’ll need it
  • Cache is persistent across sessions
Manage Memory:
  • Disable RAG for simple conversations
  • Clear old conversations periodically
  • Close unused tabs
Improve Accuracy:
  • Use clear, descriptive messages
  • Keep conversations focused
  • Start new chats for different topics
Model Constraints:
  • Works best with English text
  • Limited to conversation history (no external docs)
  • Semantic similarity is probabilistic
  • May retrieve unexpected matches
Context Window:
  • Only retrieves top 5 messages by default
  • Very old messages may not be retrieved
  • Token limits still apply to final prompt
Processing Time:
  • Adds ~50-100ms per message
  • Acceptable for most use cases
  • May be noticeable on slow devices

Technical Details

Model Information

all-MiniLM-L6-v2:
  • Size: ~25MB
  • Dimensions: 384
  • Max Sequence Length: 256 tokens
  • Performance: 50-100ms per embedding
  • Accuracy: 0.85+ on semantic similarity tasks

Cosine Similarity

// How similarity is calculated
const cosineSimilarity = (vecA: number[], vecB: number[]): number => {
  const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
  const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
  return dotProduct / (magnitudeA * magnitudeB);
};

// Returns value between -1 and 1:
// 1.0 = identical
// 0.0 = unrelated
// -1.0 = opposite (rare in practice)

Integration with Chat

// Simplified flow in chat hook

const sendMessage = async (message: string) => {
  let contextMessages = [];
  
  // If RAG is enabled
  if (settings.ragEnabled) {
    // Get relevant context
    contextMessages = await getRelevantContext(
      message,
      conversationHistory,
      5
    );
  }
  
  // Build final message array
  const messages = [
    ...contextMessages,        // Relevant history
    { role: 'user', content: message }  // Current message
  ];
  
  // Send to AI
  const response = await streamAIResponse(
    messages,
    apiKey,
    model,
    onChunk,
    systemPrompt
  );
};

Future Enhancements

These features are planned for future releases:
  • Document Upload: Embed and search your own documents
  • Cross-Conversation Search: Search across all conversations
  • Custom Embedding Models: Use specialized domain models
  • Hybrid Search: Combine semantic + keyword search
  • Context Visualization: See which messages were retrieved and why

Troubleshooting

Cause: Model download (~25MB)Solution:
  • Wait for initial download (one-time)
  • Model is cached for future sessions
  • Subsequent uses are instant
Possible Causes:
  • Messages are semantically different than expected
  • Other messages are more similar
  • Message is beyond top 5 results
Solutions:
  • Use more specific language
  • Increase maxMessages parameter
  • Check similarity scores in console (if debugging)
Symptoms: Lag, high memory usageSolutions:
  • Disable RAG for simple conversations
  • Clear old conversations
  • Close other tabs
  • Use a more powerful device
Checklist:
  • Is RAG enabled in settings?
  • Is there conversation history?
  • Check browser console for errors
  • Try refreshing the page
  • Clear browser cache if model is corrupted

Summary

PolyChat-AI’s RAG implementation provides: Privacy-first local embeddings ✅ Semantic search beyond keywords
Automatic context enhancement ✅ Zero cost - no API calls ✅ Fast inference - ~50ms per message ✅ Easy to use - toggle in settings
Built on:
  • @xenova/transformers for browser ML
  • all-MiniLM-L6-v2 embedding model
  • Cosine similarity for relevance
  • Smart context selection (top 5 messages)

Back to Features

Explore other core features of PolyChat-AI

Build docs developers (and LLMs) love