Overview
PolyChat-AI includes RAG (Retrieval-Augmented Generation) capabilities to enhance AI responses with relevant context from your conversation history. Using local embeddings, the system retrieves semantically similar previous messages to provide better, more contextual responses.Privacy-First: All RAG processing happens locally in your browser using WebAssembly. No conversation data is sent to external services for embedding generation.
How It Works
Architecture
Technical Stack
Embeddings Model
all-MiniLM-L6-v2 by Sentence Transformers
- 384-dimensional vectors
- Optimized for semantic similarity
- Fast inference in browser
Framework
@xenova/transformers
- Transformers.js for browser ML
- WebAssembly acceleration
- No external API calls
Implementation
Core RAG Service
Fromsrc/services/ragService.ts:
Embedding Generation
Context Retrieval
Features
Semantic Similarity Search
Not just keyword matching - understands meaning:- Example 1: Synonyms
- Example 2: Context
Local Processing
Complete privacy - everything runs in your browser:- No API calls for embeddings
- No data leaves your device
- No additional costs
- Works offline (after initial model download)
- Fast inference (~50ms per message)
Smart Context Selection
Retrieves up to 5 most relevant messages:- Balance between context and token usage
- Enough context for most conversations
- Prevents context window overflow
- Configurable if needed
Configuration
Enable/Disable RAG
- ✅ Long conversations with multiple topics
- ✅ Need to reference earlier discussions
- ✅ Complex problem-solving over time
- ✅ Want AI to remember context automatically
- ❌ Short, simple queries
- ❌ Each message is independent
- ❌ Browser performance concerns
- ❌ Want faster response times
Performance Considerations
Initial Load
~25MB model download (one-time)
- Cached in browser
- Only on first use
- Automatic background loading
Inference Speed
~50ms per message
- Fast enough for real-time
- Minimal impact on UX
- WebAssembly accelerated
Memory Usage
~100MB additional RAM
- Model in memory
- Embeddings cached
- Acceptable for modern browsers
Context Quality
Significantly better responses
- Relevant history included
- Coherent long conversations
- Better understanding
Use Cases
1. Long Technical Discussions
Diagnosis
“It seems to lag when scrolling through lists”AI gets context about the React app and scrolling issues.
Solution Exploration
“I tried using useMemo but it didn’t help”RAG retrieves previous messages about React and performance.
2. Project Planning
3. Code Review Across Sessions
Advanced Usage
Adjusting Number of Retrieved Messages
Modify in your code:Similarity Threshold
Filter by minimum similarity:Custom Embeddings
For specialized domains, you could swap the model:Best Practices
When to Use RAG
When to Use RAG
Ideal Scenarios:
- Conversations spanning multiple sessions
- Complex problem-solving requiring history
- Technical support or debugging
- Project planning and decision tracking
- Learning sessions with progressive topics
- Single-question queries
- Independent tasks
- Template-based conversations
- Quick factual questions
Optimizing Performance
Optimizing Performance
Reduce Initial Load Time:
- Model loads on first RAG usage
- Pre-load if you know you’ll need it
- Cache is persistent across sessions
- Disable RAG for simple conversations
- Clear old conversations periodically
- Close unused tabs
- Use clear, descriptive messages
- Keep conversations focused
- Start new chats for different topics
Understanding Limitations
Understanding Limitations
Model Constraints:
- Works best with English text
- Limited to conversation history (no external docs)
- Semantic similarity is probabilistic
- May retrieve unexpected matches
- Only retrieves top 5 messages by default
- Very old messages may not be retrieved
- Token limits still apply to final prompt
- Adds ~50-100ms per message
- Acceptable for most use cases
- May be noticeable on slow devices
Technical Details
Model Information
all-MiniLM-L6-v2:- Size: ~25MB
- Dimensions: 384
- Max Sequence Length: 256 tokens
- Performance: 50-100ms per embedding
- Accuracy: 0.85+ on semantic similarity tasks
Cosine Similarity
Integration with Chat
Future Enhancements
These features are planned for future releases:
- Document Upload: Embed and search your own documents
- Cross-Conversation Search: Search across all conversations
- Custom Embedding Models: Use specialized domain models
- Hybrid Search: Combine semantic + keyword search
- Context Visualization: See which messages were retrieved and why
Troubleshooting
RAG is slow on first use
RAG is slow on first use
Cause: Model download (~25MB)Solution:
- Wait for initial download (one-time)
- Model is cached for future sessions
- Subsequent uses are instant
Not retrieving expected messages
Not retrieving expected messages
Possible Causes:
- Messages are semantically different than expected
- Other messages are more similar
- Message is beyond top 5 results
- Use more specific language
- Increase maxMessages parameter
- Check similarity scores in console (if debugging)
Browser performance issues
Browser performance issues
Symptoms: Lag, high memory usageSolutions:
- Disable RAG for simple conversations
- Clear old conversations
- Close other tabs
- Use a more powerful device
RAG not working at all
RAG not working at all
Checklist:
- Is RAG enabled in settings?
- Is there conversation history?
- Check browser console for errors
- Try refreshing the page
- Clear browser cache if model is corrupted
Summary
PolyChat-AI’s RAG implementation provides: ✅ Privacy-first local embeddings ✅ Semantic search beyond keywords✅ Automatic context enhancement ✅ Zero cost - no API calls ✅ Fast inference - ~50ms per message ✅ Easy to use - toggle in settings Built on:
@xenova/transformersfor browser MLall-MiniLM-L6-v2embedding model- Cosine similarity for relevance
- Smart context selection (top 5 messages)
Back to Features
Explore other core features of PolyChat-AI