Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Mintplex-Labs/anything-llm/llms.txt

Use this file to discover all available pages before exploring further.

Multimodal support in AnythingLLM means you can send images alongside text in any chat and receive responses that reason about visual content. A vision-capable model can read a screenshot of an error message, describe the contents of a chart, compare product images, or extract information from a photo of a document — all within the same workspace chat interface you already use for text-only conversations. AnythingLLM also ships with configurable Speech-to-Text (STT) and Text-to-Speech (TTS) providers so that voice interaction is possible alongside visual input.
Multimodal image understanding requires a vision-capable model to be configured. If your selected model does not support vision, image attachments will either be ignored or cause an API error. See the supported providers section below.

Supported Vision-Capable Providers

OpenAI’s gpt-4o and other GPT-4 series models with vision support accept images encoded as base64 data URLs or public HTTPS URLs. AnythingLLM sends images using the image_url content block format supported by the OpenAI API.Recommended model: gpt-4o
Claude 3 and Claude 3.5 / 3.7 models accept images as base64-encoded image content blocks. AnythingLLM extracts the base64 data from the attachment and sends it with the correct media_type (image/png, image/jpeg, etc.).Default model: claude-3-5-sonnet-20241022
Gemini models support multimodal inputs via image_url content blocks. Pass the image as a base64 data URL and Gemini will process it alongside the text prompt.
Any provider whose API follows the OpenAI multimodal message format (Groq, OpenRouter, generic OpenAI-compatible endpoints, etc.) will work as long as the underlying model supports vision. Check your provider’s documentation to confirm vision support before selecting a model.

Attaching Images in Chat

1

Select a vision-capable workspace

Make sure the workspace (or the global setting) is configured to use a model that supports vision. You can verify or change the model in Workspace Settings → LLM Provider.
2

Attach an image

In the chat input area, click the paperclip (attachment) icon and select an image file (PNG, JPG, JPEG, or WEBP). You can attach multiple images in a single message.
3

Send your message

Type your question about the image and press Send. The image is encoded as a base64 data URL and sent to the LLM alongside your text message.
4

Receive a visual response

The model analyzes the image and responds based on both the image content and any text query you provided. Responses can include descriptions, extracted text, data analysis, or answers to questions about the image.

OCR: Image-Based PDFs and Standalone Images

For document ingestion (as opposed to chat attachments), AnythingLLM uses Tesseract.js OCR to extract text from:
  • Image-only PDFs — PDFs with no embedded text layer (e.g., scanned documents)
  • Standalone image files — PNG, JPG, JPEG, WEBP
OCR runs automatically as part of the ingestion pipeline. No manual configuration is needed. The default OCR language is English (eng). Configure additional languages via the TARGET_OCR_LANG environment variable:
# Example: English + Spanish + French
TARGET_OCR_LANG=eng,spa,fra
Language model data is cached under storage/models/tesseract/ on first use.

Speech-to-Text (STT)

STT converts spoken audio into text so you can dictate messages or upload audio files for transcription.
ProviderEnvironment VariableNotes
Native (Xenova Whisper)STT_PROVIDER=nativeRuns locally. Default model: Xenova/whisper-small. Change with WHISPER_MODEL_PREF.
OpenAI WhisperSTT_PROVIDER=openaiUses the OpenAI API. Configure model with STT_OPEN_AI_MODEL.
DeepgramSTT_PROVIDER=deepgramRequires STT_DEEPGRAM_API_KEY. Configure model with STT_DEEPGRAM_MODEL.
GroqSTT_PROVIDER=groqRequires STT_GROQ_API_KEY. Configure model with STT_GROQ_MODEL.
Generic OpenAI-compatibleSTT_PROVIDER=generic-openaiRequires STT_OPEN_AI_COMPATIBLE_KEY, STT_OPEN_AI_COMPATIBLE_MODEL, and STT_OPEN_AI_COMPATIBLE_ENDPOINT.
LemonadeSTT_PROVIDER=lemonadeLocal runner. Configure with STT_LEMONADE_BASE_PATH and STT_LEMONADE_MODEL_PREF.

Text-to-Speech (TTS)

TTS converts the model’s text responses into spoken audio, which is played back in the chat UI.
ProviderEnvironment VariableNotes
Native (browser)TTS_PROVIDER=nativeUses the browser’s built-in Web Speech API. No API key required.
OpenAITTS_PROVIDER=openaiRequires TTS_OPEN_AI_KEY. Configure voice with TTS_OPEN_AI_VOICE_MODEL.
ElevenLabsTTS_PROVIDER=elevenlabsRequires TTS_ELEVEN_LABS_KEY. Configure voice with TTS_ELEVEN_LABS_VOICE_MODEL.
KokoroTTS_PROVIDER=kokoroSelf-hosted. Configure with TTS_KOKORO_ENDPOINT, TTS_KOKORO_KEY, and TTS_KOKORO_VOICE_MODEL.
Piper TTSTTS_PROVIDER=piperSelf-hosted. Configure voice with TTS_PIPER_VOICE_MODEL (default: en_US-hfc_female-medium).
Generic OpenAI-compatibleTTS_PROVIDER=generic-openaiRequires TTS_OPEN_AI_COMPATIBLE_KEY, TTS_OPEN_AI_COMPATIBLE_MODEL, TTS_OPEN_AI_COMPATIBLE_VOICE_MODEL, and TTS_OPEN_AI_COMPATIBLE_ENDPOINT.

Combining Vision, STT, and TTS

These three capabilities are independent — you can mix and match them to build the experience that fits your use case:
  • Vision only — Attach images to text chats for visual question answering or document analysis.
  • STT only — Dictate messages or upload audio recordings for transcription without needing a vision model.
  • TTS only — Have the assistant read its responses aloud while interacting via text.
  • All three — Full voice-and-vision interaction: speak your question, attach an image, and have the response read back to you.

Build docs developers (and LLMs) love