Multimodal support

Prism is built to handle more than text documents. You can upload images, source code in dozens of languages, PDFs, and Word documents. All content types are processed into semantic chunks, embedded, and indexed — so everything is searchable by meaning and available as RAG context.

Supported content types

Category	Formats
Documents	PDF, DOCX, MD, TXT
Code	js, jsx, ts, tsx, py, java, cpp, c, h, hpp, cs, rb, go, rs, php, swift, kt, scala, r, css, scss, sass, html, xml, json, yaml, yml, sql, sh, bash, ps1, bat, cmake, dockerfile
Images	jpg, jpeg, png, gif, webp, bmp, svg

How each type is processed

PDF
DOCX
Markdown and plain text
Code
Images

PDF files are parsed page by page using pdf2json. Each page’s text content is extracted and concatenated with double newlines between pages. The resulting text is then split into overlapping chunks of up to 1,000 characters with a 200-character overlap, so context is not lost at chunk boundaries.

Scanned PDFs (image-only, no embedded text layer) cannot be extracted this way. Use an OCR tool to add a text layer before uploading.

Word documents are processed using the mammoth library, which extracts raw text while discarding formatting, headers, footers, and tables that cannot be represented as plain text. The extracted content is then chunked and indexed the same way as PDFs.

Markdown (.md) and plain text (.txt) files are read as raw UTF-8 text. Markdown files use a header-aware chunking strategy: content is split at heading boundaries (#, ##, etc.) first, then further divided if individual sections exceed the chunk size. This keeps logical sections together.

Images are handled differently from text-based files. Prism sends the image to Gemini Vision (gemini-2.5-flash) with a detailed analysis prompt. Gemini returns a textual description covering:

Main subjects or objects in the image
Actions or activities taking place
Setting, background, and environment
Colors, lighting, and visual style
Any visible text, logos, or symbols
Overall mood or purpose

This description becomes the text that gets chunked and embedded — making the image fully searchable by semantic meaning even though it contains no extractable text. The description is also used as RAG context when you chat with an image.

The processing pipeline

Regardless of file type, all content goes through the same downstream pipeline once text is extracted:

Extract text

Text is extracted from the file using the appropriate method for its format. For images, Gemini Vision generates the description.

Chunk

The text is split into overlapping chunks. The default chunk size is 1,000 characters with a 200-character overlap. Sentence and paragraph boundaries are respected where possible to avoid splitting mid-thought.

Embed

Each chunk is converted to a 768-dimensional vector by the Gemini text-embedding-004 model. Chunks are processed in batches of up to 100 at a time.

Index

The vectors are upserted into Qdrant with metadata including document name, type, category, chunk index, and upload date. All content is then searchable.

Image search in practice

Because image descriptions cover visual attributes — subjects, colors, setting, mood, text, and logos — you can search for images using descriptive phrases:

“a dark background with white text showing a terminal output”
“a flowchart showing a login sequence”
“photo of a whiteboard with handwritten architecture diagram”

The same description is available as context when you chat, so you can ask questions like “what does the diagram in architecture.png show?” and get a substantive answer.

If Gemini Vision is temporarily unavailable during upload, Prism retries up to three times with exponential backoff before failing the image ingestion.

Get Started

Core Features

Authentication & Plans

Architecture

Multimodal support

Supported content types

How each type is processed

The processing pipeline

Image search in practice

Build docs developers (and LLMs) love

Get Started

Core Features

Authentication & Plans

Architecture

​Supported content types

​How each type is processed

​The processing pipeline

​Image search in practice

Build docs developers (and LLMs) love

Supported content types

How each type is processed

The processing pipeline

Image search in practice