RAG Service

The RAG (Retrieval-Augmented Generation) service provides semantic search capabilities using transformer-based embeddings to find relevant context from conversation history.

Overview

The service uses the Xenova/all-MiniLM-L6-v2 model for generating embeddings and performs cosine similarity calculations to retrieve the most relevant messages from chat history.

Core Function

getRelevantContext

Retrieves the most semantically relevant messages from conversation history based on a query.

getRelevantContext(
  query: string,
  history: Message[],
  maxMessages?: number
): Promise<Message[]>

query

string

required

The user query or current message to find relevant context for

history

Message[]

required

Array of previous conversation messages to search through

maxMessages

number

default:"5"

Maximum number of relevant messages to return

relevantMessages

Message[]

Array of most relevant messages, sorted chronologically

Example:

import { getRelevantContext } from './services/ragService';

const history = [
  {
    id: '1',
    role: 'user',
    content: 'What is machine learning?',
    timestamp: new Date('2024-01-01')
  },
  {
    id: '2',
    role: 'assistant',
    content: 'Machine learning is a subset of AI...',
    timestamp: new Date('2024-01-01')
  },
  // ... more messages
];

const relevantMsgs = await getRelevantContext(
  'How does neural network training work?',
  history,
  3
);

console.log(relevantMsgs);
// Returns top 3 most relevant messages about ML/training

How It Works

1. Embedding Generation

The service generates embeddings using a singleton pattern to ensure the model is loaded only once:

// Internal implementation
class EmbeddingService {
  private static instance: FeatureExtractionPipeline | null = null;

  static async getInstance() {
    if (this.instance === null) {
      this.instance = await pipeline(
        'feature-extraction',
        'Xenova/all-MiniLM-L6-v2'
      );
    }
    return this.instance;
  }
}

2. Semantic Similarity

Embeddings are compared using cosine similarity to find the most relevant messages:

Query is converted to an embedding vector
All history messages are converted to embeddings
Cosine similarity is calculated between query and each message
Top N messages with highest similarity scores are selected
Results are sorted chronologically for natural conversation flow

3. Performance Optimization

Singleton pattern: Model loaded once and reused
Batch processing: Embeddings generated in parallel
Mean pooling: Efficient vector representation
Normalized embeddings: Optimized similarity calculations

Internal Functions

getEmbeddings (Internal)

Generates embeddings for a batch of text inputs.

async function getEmbeddings(texts: string[]): Promise<number[][]>

This function:

Takes an array of text strings
Generates embeddings using the MiniLM model
Uses mean pooling for aggregation
Normalizes vectors for consistent similarity scores
Returns array of embedding vectors

Example workflow:

// Input texts
const texts = [
  'What is TypeScript?',
  'user: Explain React hooks',
  'assistant: React hooks are functions...'
];

// Generated embeddings (simplified)
const embeddings = await getEmbeddings(texts);
// Returns: [[0.1, 0.2, ...], [0.3, 0.1, ...], ...]
// Each embedding is a 384-dimensional vector

Use Cases

1. Context-Aware Responses

const userQuery = 'Can you elaborate on that?';
const context = await getRelevantContext(userQuery, conversationHistory, 3);

// Send context + query to AI model
const messagesWithContext = [...context, { role: 'user', content: userQuery }];
const response = await fetchAIResponse(messagesWithContext, apiKey, model);

2. Long Conversation Management

// For long conversations, retrieve only relevant parts
if (conversationHistory.length > 50) {
  const relevantContext = await getRelevantContext(
    currentQuery,
    conversationHistory,
    10 // Get top 10 relevant messages
  );
  
  // Use only relevant context instead of full history
  const response = await streamAIResponse(
    relevantContext,
    apiKey,
    model,
    onChunk
  );
}

// Find all messages related to a specific topic
const topicQuery = 'database optimization techniques';
const relatedMessages = await getRelevantContext(
  topicQuery,
  allMessages,
  20
);

// Display related discussion points
console.log('Related discussions:', relatedMessages);

Type Definitions

Message

interface Message {
  id: string;
  role: 'user' | 'assistant' | 'system';
  content: string | MessageContent[];
  timestamp: Date;
}

Note: When content is MessageContent[], the RAG service extracts only text content for embedding generation.

Model Information

Xenova/all-MiniLM-L6-v2

Type: Sentence transformer model
Embedding dimension: 384
Use case: Semantic similarity and search
Performance: Fast, lightweight, browser-compatible
Library: @xenova/transformers (Transformers.js)

Best Practices

Limit maxMessages: Keep between 3-10 for optimal context window
Filter history: Remove system messages or metadata before searching
Cache embeddings: Consider caching embeddings for frequently accessed messages
Handle empty history: Check if history is empty before calling

if (history.length === 0) {
  // No context available
  return [];
}

const context = await getRelevantContext(query, history, 5);

Error Handling

The service handles errors gracefully:

try {
  const context = await getRelevantContext(query, history);
  // Use context
} catch (error) {
  console.error('RAG service error:', error);
  // Fall back to recent messages or full history
  const fallback = history.slice(-5);
}

Services

Hooks

Types

Overview

Core Function

getRelevantContext

How It Works

1. Embedding Generation

2. Semantic Similarity

3. Performance Optimization

Internal Functions

getEmbeddings (Internal)

Use Cases

1. Context-Aware Responses

2. Long Conversation Management

Type Definitions

Message

Model Information

Xenova/all-MiniLM-L6-v2

Best Practices

Error Handling

Build docs developers (and LLMs) love

Services

Hooks

Types

Documentation Index

​Overview

​Core Function

​getRelevantContext

​How It Works

​1. Embedding Generation

​2. Semantic Similarity

​3. Performance Optimization

​Internal Functions

​getEmbeddings (Internal)

​Use Cases

​1. Context-Aware Responses

​2. Long Conversation Management

​3. Finding Related Topics

​Type Definitions

​Message

​Model Information

​Xenova/all-MiniLM-L6-v2

​Best Practices

​Error Handling

Build docs developers (and LLMs) love

Overview

Core Function

getRelevantContext

How It Works

1. Embedding Generation

2. Semantic Similarity

3. Performance Optimization

Internal Functions

getEmbeddings (Internal)

Use Cases

1. Context-Aware Responses

2. Long Conversation Management

3. Finding Related Topics

Type Definitions

Message

Model Information

Xenova/all-MiniLM-L6-v2

Best Practices

Error Handling