Supported content types
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, MD, TXT |
| Code | js, jsx, ts, tsx, py, java, cpp, c, h, hpp, cs, rb, go, rs, php, swift, kt, scala, r, css, scss, sass, html, xml, json, yaml, yml, sql, sh, bash, ps1, bat, cmake, dockerfile |
| Images | jpg, jpeg, png, gif, webp, bmp, svg |
How each type is processed
- PDF
- DOCX
- Markdown and plain text
- Code
- Images
PDF files are parsed page by page using
pdf2json. Each page’s text content is extracted and concatenated with double newlines between pages. The resulting text is then split into overlapping chunks of up to 1,000 characters with a 200-character overlap, so context is not lost at chunk boundaries.Scanned PDFs (image-only, no embedded text layer) cannot be extracted this way. Use an OCR tool to add a text layer before uploading.
The processing pipeline
Regardless of file type, all content goes through the same downstream pipeline once text is extracted:Extract text
Text is extracted from the file using the appropriate method for its format. For images, Gemini Vision generates the description.
Chunk
The text is split into overlapping chunks. The default chunk size is 1,000 characters with a 200-character overlap. Sentence and paragraph boundaries are respected where possible to avoid splitting mid-thought.
Embed
Each chunk is converted to a 768-dimensional vector by the Gemini
text-embedding-004 model. Chunks are processed in batches of up to 100 at a time.Image search in practice
Because image descriptions cover visual attributes — subjects, colors, setting, mood, text, and logos — you can search for images using descriptive phrases:- “a dark background with white text showing a terminal output”
- “a flowchart showing a login sequence”
- “photo of a whiteboard with handwritten architecture diagram”