Skip to main content
Prism uses Google’s Gemini API for three distinct tasks: generating vector embeddings for documents and queries, producing streaming conversational responses, and describing images so they can be indexed and searched. All Gemini calls go through lib/gemini.ts using the @google/generative-ai SDK.
import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(process.env.NEXT_PUBLIC_GEMINI_API_KEY || '');

Models

Prism uses two Gemini models. Each is instantiated separately because they have different roles and different generation configurations.

gemini-2.5-flash

Used for chat responses (streaming), RAG-grounded answers, image descriptions, and chat title generation

text-embedding-004

Used for embedding document chunks and search queries — produces 768-dimensional vectors

Chat model configuration

The chat model is initialized once at module level with fixed generation parameters:
const model = genAI.getGenerativeModel({
  model: 'gemini-2.5-flash',
  generationConfig: {
    temperature: 0.7,
    topP: 0.95,
    topK: 40,
    maxOutputTokens: 2048,
  },
});
ParameterValueEffect
temperature0.7Balanced creativity — not deterministic, not erratic
topP0.95Nucleus sampling; considers top 95% of probability mass
topK40Limits each token selection to the 40 most probable tokens
maxOutputTokens2048Hard ceiling on response length

Streaming chat responses

generateChatResponse

async function generateChatResponse(
  messages: ChatMessage[],
  onChunk?: (chunk: string) => void
): Promise<string>
The ChatMessage type is { role: 'user' | 'model'; parts: string }. The function handles two cases: Single-turn — if there is only one message (or only one user message after filtering), it calls model.generateContentStream directly with the message text. Multi-turn — for conversations with history, it:
  1. Passes all but the last message as history to model.startChat
  2. Sends the latest message via chat.sendMessageStream
In both cases, tokens are streamed token-by-token. If onChunk is provided, it is called with each text fragment as it arrives. The complete response text is also returned as the resolved promise value.
The function filters the message list before building history: it keeps all user messages and only model messages that directly follow a user message. This prevents malformed turn sequences from reaching the API.

How the chat route uses streaming

The /api/chat route wraps generateChatResponse in a ReadableStream and returns it with Content-Type: text/event-stream. The SSE event format is:
data: {"sources": [{"index": 1, "documentId": "...", "documentName": "...", "score": 0.72}]}\n\n
data: {"chunk": "Here is"}\n\n
data: {"chunk": " the answer..."}\n\n
data: [DONE]\n\n
Source metadata is sent before any text chunks so the client can render citation badges while the response is still streaming.

RAG responses

generateRAGResponse

async function generateRAGResponse(
  query: string,
  documentContext: string[],
  onChunk?: (chunk: string) => void
): Promise<string>
This function constructs a grounded prompt and streams the response. The prompt template:
You are an AI assistant named PRISM built by Neurhack. You have access to the
user's Prism document library. Answer the question based ONLY on the provided
context. If the context doesn't contain enough information, say so clearly.
Always be accurate and cite which document section your answer comes from.

CONTEXT FROM DOCUMENTS:
{context joined by "\n\n---\n\n"}

USER QUESTION: {query}

ANSWER:
The /api/chat route uses generateChatResponse rather than generateRAGResponse directly — it injects the retrieved chunks into the user message text before calling the chat function. generateRAGResponse is available as a standalone utility for callers that want a simpler single-turn RAG interface without conversation history.

Embeddings

generateEmbedding

Converts a single text string into a 768-dimensional float vector:
async function generateEmbedding(text: string): Promise<number[]>
Used during search and chat to embed the user’s query before the Qdrant similarity lookup.

batchGenerateEmbeddings

Converts an array of text strings into an array of 768-dimensional vectors:
async function batchGenerateEmbeddings(texts: string[]): Promise<number[][]>
Used during document indexing after chunkText splits a document. The function processes texts in batches of 100, running each batch in parallel with Promise.all:
const batchSize = 100;
for (let i = 0; i < texts.length; i += batchSize) {
  const batch = texts.slice(i, i + batchSize);
  const batchResults = await Promise.all(
    batch.map((text) => embeddingModel.embedContent(text))
  );
  results.push(...batchResults.map((r) => r.embedding.values));
}
Because embeddings within a batch are requested in parallel, a document with 300 chunks requires only 3 serial API calls rather than 300.
Both embedding functions use text-embedding-004, instantiated separately from the chat model. The embedding model has no generation configuration — it always returns exactly 768 dimensions.

Image analysis

generateImageDescription

async function generateImageDescription(
  imageBuffer: Buffer,
  mimeType: string,
  maxRetries?: number   // default: 3
): Promise<string>
Images cannot be embedded directly. Instead, Prism uses gemini-2.5-flash (vision) to generate a text description of the image, then embeds that description. The model receives the image as base64-encoded inline data alongside a structured prompt:
Analyze this image in detail and provide a comprehensive description. Include:
1. Main subjects or objects in the image
2. Actions or activities taking place
3. Setting, background, and environment
4. Colors, lighting, and visual style
5. Any text, logos, or symbols visible
6. Overall mood or purpose of the image

Provide a clear, searchable description that would help someone find this image later.
The six-point structure ensures the description covers aspects a user might query from different angles — visual content, context, embedded text, and purpose.

Retry logic

The function retries on HTTP 503 (Service Unavailable) and 429 (Rate Limited) responses using exponential backoff:
AttemptDelay before retry
1 → 22 seconds (2^1 × 1000ms)
2 → 34 seconds (2^2 × 1000ms)
3 (final)Throws the error
If all retries are exhausted, the indexing route falls back to a minimal metadata description ("Image file: {name}. Format: PNG. Uploaded on...") so the document is still indexed, just with reduced semantic richness.

Chat title generation

generateChatTitle

async function generateChatTitle(messages: any[]): Promise<string>
Generates a short title for a conversation. It collects the first three user messages, joins them, and sends the following prompt to gemini-2.5-flash:
Based on this conversation, generate a very short, descriptive title
(maximum 5 words, no quotes or special characters). Just return the title text:

{userMessages}
The response is trimmed, stripped of quotes, and truncated to 50 characters. If the call fails for any reason, the function returns 'New Chat' rather than throwing.

Build docs developers (and LLMs) love