FrostAgent has built-in multimodal support: whenever a user sends an image through the OneBot adapter, FrostAgent downloads it, encodes it as Base64, and sends it to a configurable vision model before the main LLM ever sees the message. The vision model returns a Chinese-language description that is stitched into the user’s prompt so the main LLM can reason about visual content without needing to handle raw image data itself.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/GuaiZai233/FrostAgent/llms.txt
Use this file to discover all available pages before exploring further.
How Image Description Works
The image pipeline is triggered automatically inside thereply function whenever the incoming message contains at least one image segment.
Detection
content.IsContainImage(segments) iterates the parsed message segment list and returns true if any segment has type: "image". No configuration is needed — detection is always active.Download and encode
content.ProcessImage fetches each image URL from the segment’s data.url field using an HTTP client with a 30-second timeout, then Base64-encodes the raw bytes with base64.StdEncoding. Only segments that carry a non-empty url are processed; segments without a URL are logged and skipped.Build content blocks
The encoded images are assembled into a JSON array of multimodal content parts — one
text block followed by one image_url block per image. This array is the contentBlocks argument forwarded to the vision model.Call vision model
llm.CallVisionModel(provider, baseURL, apiKey, _, contentBlocks) sends the content block array to the vision model and returns its text response.Configuration
Vision is controlled by a single environment variable alongside the main model settings. Set it in your.env file:
| Variable | Default | Description |
|---|---|---|
VISUAL_MODEL_NAME | (empty) | The model identifier sent in the model field of the vision request. Must support multimodal (image + text) input in OpenAI-compatible format. If left empty, an empty string is passed to the provider — set this explicitly to a vision-capable model. |
Vision requires a model that accepts multimodal content parts — for example
qwen-vl-plus, qwen-vl-max, or gpt-4o. Setting VISUAL_MODEL_NAME to a text-only model, or leaving it empty, will cause the vision call to fail or return an error string. There is no automatic fallback to MODEL_NAME — CallVisionModel passes VISUAL_MODEL_NAME directly to the provider. See Configuration for the full list of environment variables.CallVisionModel Function
The vision call is implemented in internal/llm/vision_description.go:
| Parameter | Description |
|---|---|
provider | The core.LLMProvider instance used by the engine (e.g. the OpenAI-compatible client). |
baseURL | The upstream API endpoint URL, passed through from the engine configuration. |
apiKey | The upstream API key, passed through from the engine configuration. |
_ | Reserved — the caller passes the main modelName here but CallVisionModel reads VISUAL_MODEL_NAME from the environment directly instead. |
contentBlocks | A JSON-encoded array of content parts (text and image_url objects). |
text content part in contentBlocks is empty or blank, CallVisionModel replaces it with the default prompt 请详细描述这张图片的内容 before making the request, preventing some models from returning empty responses when no user text accompanies the image.
Content Part Format
content.ProcessImage builds the contentBlocks JSON that is forwarded to CallVisionModel. Each call produces an array with this structure:
text value instead of the default prompt. When a single message contains multiple images, each one becomes an additional image_url block appended after the text block:
ContentBlock Go struct that produces this JSON is defined in internal/adapter/onebot/content/struct.go: