Gemini integration

Prism uses Google’s Gemini API for three distinct tasks: generating vector embeddings for documents and queries, producing streaming conversational responses, and describing images so they can be indexed and searched. All Gemini calls go through lib/gemini.ts using the @google/generative-ai SDK.

import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(process.env.NEXT_PUBLIC_GEMINI_API_KEY || '');

Models

Prism uses two Gemini models. Each is instantiated separately because they have different roles and different generation configurations.

gemini-2.5-flash

Used for chat responses (streaming), RAG-grounded answers, image descriptions, and chat title generation

text-embedding-004

Used for embedding document chunks and search queries — produces 768-dimensional vectors

Chat model configuration

The chat model is initialized once at module level with fixed generation parameters:

const model = genAI.getGenerativeModel({
  model: 'gemini-2.5-flash',
  generationConfig: {
    temperature: 0.7,
    topP: 0.95,
    topK: 40,
    maxOutputTokens: 2048,
  },
});

Parameter	Value	Effect
`temperature`	`0.7`	Balanced creativity — not deterministic, not erratic
`topP`	`0.95`	Nucleus sampling; considers top 95% of probability mass
`topK`	`40`	Limits each token selection to the 40 most probable tokens
`maxOutputTokens`	`2048`	Hard ceiling on response length

Streaming chat responses

`generateChatResponse`

async function generateChatResponse(
  messages: ChatMessage[],
  onChunk?: (chunk: string) => void
): Promise<string>

The ChatMessage type is { role: 'user' | 'model'; parts: string }. The function handles two cases: Single-turn — if there is only one message (or only one user message after filtering), it calls model.generateContentStream directly with the message text. Multi-turn — for conversations with history, it:

Passes all but the last message as history to model.startChat
Sends the latest message via chat.sendMessageStream

In both cases, tokens are streamed token-by-token. If onChunk is provided, it is called with each text fragment as it arrives. The complete response text is also returned as the resolved promise value.

The function filters the message list before building history: it keeps all user messages and only model messages that directly follow a user message. This prevents malformed turn sequences from reaching the API.

How the chat route uses streaming

The /api/chat route wraps generateChatResponse in a ReadableStream and returns it with Content-Type: text/event-stream. The SSE event format is:

data: {"sources": [{"index": 1, "documentId": "...", "documentName": "...", "score": 0.72}]}\n\n
data: {"chunk": "Here is"}\n\n
data: {"chunk": " the answer..."}\n\n
data: [DONE]\n\n

Source metadata is sent before any text chunks so the client can render citation badges while the response is still streaming.

RAG responses

`generateRAGResponse`

async function generateRAGResponse(
  query: string,
  documentContext: string[],
  onChunk?: (chunk: string) => void
): Promise<string>

This function constructs a grounded prompt and streams the response. The prompt template:

You are an AI assistant named PRISM built by Neurhack. You have access to the
user's Prism document library. Answer the question based ONLY on the provided
context. If the context doesn't contain enough information, say so clearly.
Always be accurate and cite which document section your answer comes from.

CONTEXT FROM DOCUMENTS:
{context joined by "\n\n---\n\n"}

USER QUESTION: {query}

ANSWER:

The /api/chat route uses generateChatResponse rather than generateRAGResponse directly — it injects the retrieved chunks into the user message text before calling the chat function. generateRAGResponse is available as a standalone utility for callers that want a simpler single-turn RAG interface without conversation history.

Embeddings

`generateEmbedding`

Converts a single text string into a 768-dimensional float vector:

async function generateEmbedding(text: string): Promise<number[]>

Used during search and chat to embed the user’s query before the Qdrant similarity lookup.

`batchGenerateEmbeddings`

Converts an array of text strings into an array of 768-dimensional vectors:

async function batchGenerateEmbeddings(texts: string[]): Promise<number[][]>

Used during document indexing after chunkText splits a document. The function processes texts in batches of 100, running each batch in parallel with Promise.all:

const batchSize = 100;
for (let i = 0; i < texts.length; i += batchSize) {
  const batch = texts.slice(i, i + batchSize);
  const batchResults = await Promise.all(
    batch.map((text) => embeddingModel.embedContent(text))
  );
  results.push(...batchResults.map((r) => r.embedding.values));
}

Because embeddings within a batch are requested in parallel, a document with 300 chunks requires only 3 serial API calls rather than 300.

Both embedding functions use text-embedding-004, instantiated separately from the chat model. The embedding model has no generation configuration — it always returns exactly 768 dimensions.

Image analysis

`generateImageDescription`

async function generateImageDescription(
  imageBuffer: Buffer,
  mimeType: string,
  maxRetries?: number   // default: 3
): Promise<string>

Images cannot be embedded directly. Instead, Prism uses gemini-2.5-flash (vision) to generate a text description of the image, then embeds that description. The model receives the image as base64-encoded inline data alongside a structured prompt:

Analyze this image in detail and provide a comprehensive description. Include:
Main subjects or objects in the image
Actions or activities taking place
Setting, background, and environment
Colors, lighting, and visual style
Any text, logos, or symbols visible
Overall mood or purpose of the image

Provide a clear, searchable description that would help someone find this image later.

The six-point structure ensures the description covers aspects a user might query from different angles — visual content, context, embedded text, and purpose.

Retry logic

The function retries on HTTP 503 (Service Unavailable) and 429 (Rate Limited) responses using exponential backoff:

Attempt	Delay before retry
1 → 2	2 seconds (`2^1 × 1000ms`)
2 → 3	4 seconds (`2^2 × 1000ms`)
3 (final)	Throws the error

If all retries are exhausted, the indexing route falls back to a minimal metadata description ("Image file: {name}. Format: PNG. Uploaded on...") so the document is still indexed, just with reduced semantic richness.

Chat title generation

`generateChatTitle`

async function generateChatTitle(messages: any[]): Promise<string>

Generates a short title for a conversation. It collects the first three user messages, joins them, and sends the following prompt to gemini-2.5-flash:

Based on this conversation, generate a very short, descriptive title
(maximum 5 words, no quotes or special characters). Just return the title text:

{userMessages}

The response is trimmed, stripped of quotes, and truncated to 50 characters. If the call fails for any reason, the function returns 'New Chat' rather than throwing.

Get Started

Core Features

Authentication & Plans

Architecture

Gemini integration

Models

gemini-2.5-flash

text-embedding-004

Chat model configuration

Streaming chat responses

`generateChatResponse`

How the chat route uses streaming

RAG responses

`generateRAGResponse`

Embeddings

`generateEmbedding`

`batchGenerateEmbeddings`

Image analysis

`generateImageDescription`

Retry logic

Chat title generation

`generateChatTitle`

Build docs developers (and LLMs) love

Get Started

Core Features

Authentication & Plans

Architecture

​Models

gemini-2.5-flash

text-embedding-004

​Chat model configuration

​Streaming chat responses

​generateChatResponse

​How the chat route uses streaming

​RAG responses

​generateRAGResponse

​Embeddings

​generateEmbedding

​batchGenerateEmbeddings

​Image analysis

​generateImageDescription

​Retry logic

​Chat title generation

​generateChatTitle

Build docs developers (and LLMs) love

Models

Chat model configuration

Streaming chat responses

`generateChatResponse`

How the chat route uses streaming

RAG responses

`generateRAGResponse`

Embeddings

`generateEmbedding`

`batchGenerateEmbeddings`

Image analysis

`generateImageDescription`

Retry logic

Chat title generation

`generateChatTitle`