The visual analysis agent converts an image into a human-readable Croatian description through two sequential stages. First, a locally loaded BLIP model produces a short English caption directly from the image pixels. Second, a Groq-hosted Llama model receives that caption and rewrites it as a natural Croatian sentence (standard mode) or a comprehensive accessibility description (detailed mode). The agent is designed to serve visually impaired users: all output is in Croatian and is later consumed by the speech agent.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dominikKos9/AgentForge/llms.txt
Use this file to discover all available pages before exploring further.
Stage 1: BLIP captioning
The_blip_caption function loads the Salesforce/blip-image-captioning-large model at module initialisation time and keeps it resident in memory for the lifetime of the process. Generation runs inside a torch.no_grad() block to avoid storing gradients.
BLIP generation parameters
| Parameter | Standard | Detailed |
|---|---|---|
max_length | 60 tokens | 120 tokens |
num_beams | 3 | 5 |
temperature | 1.0 | 1.0 |
repetition_penalty | 1.2 | 1.2 |
The BLIP model runs entirely on CPU by default because no
.to(device) call is made and torch.no_grad() is used throughout. For faster inference on machines with a GPU, load the model with model.to("cuda") before calling _blip_caption.Stage 2: LLM expansion with Groq
The_expand_with_llm function takes the English BLIP caption and rewrites it using the llama-3.3-70b-versatile model via the Groq API. The system prompt and temperature differ between the two modes.
LLM parameters by mode
| Parameter | Standard | Detailed |
|---|---|---|
| Model | llama-3.3-70b-versatile | llama-3.3-70b-versatile |
| Temperature | 0.5 | 0.7 |
| Output | One short Croatian sentence | Multi-sentence Croatian accessibility description |
Main agent function
_blip_caption is always called with detailed=False — the BLIP stage always produces a compact caption regardless of mode. The detailed flag only affects the LLM expansion step.
State fields
Inputs
Path to the image file on disk. The file is opened with Pillow and converted to RGB before being passed to the BLIP processor.
When
true, the LLM expansion produces a comprehensive multi-sentence accessibility description at temperature 0.7. When false (the default), a single concise Croatian sentence is returned at temperature 0.5.Outputs
The final Croatian-language image description generated by the Groq LLM. This value is passed directly to the speech agent as the text to synthesise.