Vision and Image Description in FrostAgent

FrostAgent has built-in multimodal support: whenever a user sends an image through the OneBot adapter, FrostAgent downloads it, encodes it as Base64, and sends it to a configurable vision model before the main LLM ever sees the message. The vision model returns a Chinese-language description that is stitched into the user’s prompt so the main LLM can reason about visual content without needing to handle raw image data itself.

How Image Description Works

The image pipeline is triggered automatically inside the reply function whenever the incoming message contains at least one image segment.

Detection

content.IsContainImage(segments) iterates the parsed message segment list and returns true if any segment has type: "image". No configuration is needed — detection is always active.

Download and encode

content.ProcessImage fetches each image URL from the segment’s data.url field using an HTTP client with a 30-second timeout, then Base64-encodes the raw bytes with base64.StdEncoding. Only segments that carry a non-empty url are processed; segments without a URL are logged and skipped.

Build content blocks

The encoded images are assembled into a JSON array of multimodal content parts — one text block followed by one image_url block per image. This array is the contentBlocks argument forwarded to the vision model.

Call vision model

llm.CallVisionModel(provider, baseURL, apiKey, _, contentBlocks) sends the content block array to the vision model and returns its text response.

Enrich the prompt

The description is appended to the original user text:

<original user text> 【图片内容】：<vision model description>

The enriched string is then passed to the main LLM as the user turn for that session.

Configuration

Vision is controlled by a single environment variable alongside the main model settings. Set it in your .env file:

Variable	Default	Description
`VISUAL_MODEL_NAME`	(empty)	The model identifier sent in the `model` field of the vision request. Must support multimodal (image + text) input in OpenAI-compatible format. If left empty, an empty string is passed to the provider — set this explicitly to a vision-capable model.

# Use a dedicated vision model
VISUAL_MODEL_NAME=qwen-vl-plus

Vision requires a model that accepts multimodal content parts — for example qwen-vl-plus, qwen-vl-max, or gpt-4o. Setting VISUAL_MODEL_NAME to a text-only model, or leaving it empty, will cause the vision call to fail or return an error string. There is no automatic fallback to MODEL_NAME — CallVisionModel passes VISUAL_MODEL_NAME directly to the provider. See Configuration for the full list of environment variables.

`CallVisionModel` Function

The vision call is implemented in internal/llm/vision_description.go:

func CallVisionModel(provider core.LLMProvider, baseURL, apiKey, _ string, contentBlocks string) string

Parameter	Description
`provider`	The `core.LLMProvider` instance used by the engine (e.g. the OpenAI-compatible client).
`baseURL`	The upstream API endpoint URL, passed through from the engine configuration.
`apiKey`	The upstream API key, passed through from the engine configuration.
`_`	Reserved — the caller passes the main `modelName` here but `CallVisionModel` reads `VISUAL_MODEL_NAME` from the environment directly instead.
`contentBlocks`	A JSON-encoded array of content parts (text and image_url objects).

The function returns a plain string — the model’s Chinese-language description of the image. If the model returns a structured content array instead of a string, the response is re-serialised to JSON as a fallback. On error, the error message itself is returned so the main LLM receives context about the failure. The system prompt sent to the vision model is fixed:

请用中文描述图片：

If the text content part in contentBlocks is empty or blank, CallVisionModel replaces it with the default prompt 请详细描述这张图片的内容 before making the request, preventing some models from returning empty responses when no user text accompanies the image.

Content Part Format

content.ProcessImage builds the contentBlocks JSON that is forwarded to CallVisionModel. Each call produces an array with this structure:

[
  {
    "type": "text",
    "text": "请详细描述这张图片的内容"
  },
  {
    "type": "image_url",
    "image_url": {
      "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAB..."
    }
  }
]

If the user’s message contained text alongside the image, that text is used as the text value instead of the default prompt. When a single message contains multiple images, each one becomes an additional image_url block appended after the text block:

[
  { "type": "text", "text": "这两张图有什么不同？" },
  { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,..." } },
  { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,..." } }
]

The ContentBlock Go struct that produces this JSON is defined in internal/adapter/onebot/content/struct.go:

type ContentBlock struct {
    Type     string            `json:"type"`
    Text     string            `json:"text,omitempty"`
    ImageURL map[string]string `json:"image_url,omitempty"`
}

Because images are downloaded and Base64-encoded before being sent to the vision model, FrostAgent works with any image URL that is publicly accessible — including the CDN URLs that OneBot clients embed in image segments automatically.

Get Started

Core Concepts

Adapters

Subagents

Guides

Vision and Image Description in FrostAgent

How Image Description Works

Configuration

`CallVisionModel` Function

Content Part Format

Build docs developers (and LLMs) love

Get Started

Core Concepts

Adapters

Subagents

Guides

Documentation Index

​How Image Description Works

​Configuration

​CallVisionModel Function

​Content Part Format

Build docs developers (and LLMs) love

How Image Description Works

Configuration

`CallVisionModel` Function

Content Part Format