Document ingestion is how you teach AnythingLLM what to know. When you upload a file or submit a URL, a dedicated collector service parses the raw content, splits it into overlapping text chunks, generates vector embeddings for each chunk, and stores those embeddings in the configured vector database. During a chat, the user’s query is embedded using the same model, and the closest matching chunks are injected into the LLM’s context window alongside the conversation — this is Retrieval-Augmented Generation (RAG) in action.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Mintplex-Labs/anything-llm/llms.txt
Use this file to discover all available pages before exploring further.
Supported File Types
AnythingLLM’s collector service accepts a wide range of formats out of the box:Text & Markup
.txt, .md, .org, .adoc, .rst, .htmlOffice Documents
.pdf, .docx, .pptx, .xlsx, .odt, .odp, .epub, .mboxData Files
.csv, .jsonAudio & Video
.mp3, .wav, .mp4, .mpeg, .ogg, .oga, .opus, .m4a, .webmImages
.png, .jpg, .jpeg, .webp — OCR is applied automaticallyLinks
Web URLs, YouTube video URLs (transcript extraction)
Audio and video files are transcribed to text using the configured Speech-to-Text provider before chunking. The resulting transcript is what gets embedded and retrieved during chats.
The Ingestion Pipeline
Every document — whether it is a local file upload or a scraped URL — passes through the same pipeline:Upload
The file is received by the AnythingLLM server and forwarded to the collector service (a separate process that handles all parsing). Files can be uploaded via the Document Manager in the UI or via the REST API.
Parse & Convert
The collector identifies the file type and routes it to the appropriate converter:
- PDFs → text extraction; falls back to OCR if no text layer is found
- DOCX, PPTX, ODT, ODP → LibreOffice-compatible MIME parser
- XLSX → spreadsheet-to-text converter
- Images (PNG, JPG, JPEG, WEBP) → Tesseract OCR
- Audio / video → Speech-to-Text transcription
- URLs → web scraper or YouTube transcript extractor
Chunk
Extracted text is split into overlapping chunks. Chunk size and overlap are configured globally. Smaller chunks improve retrieval precision; larger chunks preserve more surrounding context per result.
Embed
Each chunk is sent to the configured embedding model (OpenAI
text-embedding-3-small, a local model via Ollama, or another supported provider) and converted to a high-dimensional vector.Store in Vector DB
Vectors and their associated metadata (source filename, page number, chunk index, etc.) are written to the vector database — LanceDB by default, or an external provider like Pinecone, Chroma, Weaviate, Qdrant, or Milvus.
OCR Support
For image-based PDFs and standalone image files (PNG, JPG, JPEG, WEBP), AnythingLLM uses Tesseract.js to extract text automatically. The OCR language defaults to English (eng). To recognize other languages, set the TARGET_OCR_LANG environment variable to a comma-separated list of Tesseract language codes:
storage/models/tesseract/ so subsequent runs are faster.
Document Manager
The Document Manager is the central hub for all uploaded content. From there you can:- Browse all documents organized in folders
- Move documents between folders
- Add documents to a workspace by selecting them and clicking Save & Embed
- Remove documents from a workspace without deleting the underlying file
- Pin documents so they are always included in context (bypassing the similarity search)
- Mark documents as watched so they are automatically re-embedded when their source changes
Watched / Synced Documents
Set a document’swatched flag to true and AnythingLLM will re-run the ingestion pipeline against the original source on a regular schedule (every 1 hour by default). This is useful for web pages, RSS feeds, or internal wikis that change frequently.
Pinned Documents
Pinned documents bypass the similarity threshold entirely. Every chunk from a pinned document is injected into the context window on every turn. This is ideal for short reference documents that must always be present — a company FAQ, a product spec sheet, or a glossary.Citations in Chat Responses
When a response is grounded in retrieved document chunks, AnythingLLM appends a Sources section to the response listing the document name, page number, and the chunk text used. Users can expand each source to read the exact excerpt the model referenced. Metadata stored with each chunk — title, page number, and the originalchunkSource URI — is surfaced as part of the citation display.
API Reference
Upload a file
Upload a URL or link
When ingesting large document sets (hundreds of PDFs or very long documents), consider tuning your chunk size and overlap settings before uploading. Smaller chunks produce more precise retrieval but require more embedding API calls and more vector storage. A chunk size of 1,000 tokens with 200-token overlap is a reasonable starting point for most use cases.