Visual analysis agent: BLIP captioning and Groq expansion

The visual analysis agent converts an image into a human-readable Croatian description through two sequential stages. First, a locally loaded BLIP model produces a short English caption directly from the image pixels. Second, a Groq-hosted Llama model receives that caption and rewrites it as a natural Croatian sentence (standard mode) or a comprehensive accessibility description (detailed mode). The agent is designed to serve visually impaired users: all output is in Croatian and is later consumed by the speech agent.

Stage 1: BLIP captioning

The _blip_caption function loads the Salesforce/blip-image-captioning-large model at module initialisation time and keeps it resident in memory for the lifetime of the process. Generation runs inside a torch.no_grad() block to avoid storing gradients.

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import os
from groq import Groq
from dotenv import load_dotenv
load_dotenv()


# MODELS INIT
processor = BlipProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-large"
)

model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-large"
)

model.eval()


def _blip_caption(image, detailed=False):

    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_length=120 if detailed else 60,
            num_beams=5 if detailed else 3,
            temperature=1.0,
            repetition_penalty=1.2
        )

    caption = processor.decode(output[0], skip_special_tokens=True)
    return caption

BLIP generation parameters

Parameter	Standard	Detailed
`max_length`	60 tokens	120 tokens
`num_beams`	3	5
`temperature`	1.0	1.0
`repetition_penalty`	1.2	1.2

The BLIP model runs entirely on CPU by default because no .to(device) call is made and torch.no_grad() is used throughout. For faster inference on machines with a GPU, load the model with model.to("cuda") before calling _blip_caption.

Stage 2: LLM expansion with Groq

The _expand_with_llm function takes the English BLIP caption and rewrites it using the llama-3.3-70b-versatile model via the Groq API. The system prompt and temperature differ between the two modes.

def _expand_with_llm(caption, detailed):

    # obični način rada → vrati kratki opis na hrvatskom
    if not detailed:
        response = groq_client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Ti si pomoćnik za pristupačnost slijepim i slabovidnim osobama. "
                        "Tvoj zadatak je pretvoriti opis slike u prirodan i kratak opis na hrvatskom jeziku. "
                        "Uvijek odgovaraj ISKLJUČIVO na hrvatskom jeziku."
                    )
                },
                {
                    "role": "user",
                    "content": (
                        f"Prevedi i prirodno opiši ovu sliku na hrvatskom jeziku:\n\n{caption}\n\n"
                        "Napiši jednu kratku i jasnu rečenicu."
                    )
                }
            ],
            temperature=0.5
        )

        return response.choices[0].message.content


    # detailed način rada
    response = groq_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": (
                    "Ti si pomoćnik za pristupačnost slijepim i slabovidnim osobama. "
                    "Tvoj zadatak je generirati detaljan i koristan opis slike na hrvatskom jeziku. "
                    "Opis mora biti prirodan, jasan i lako razumljiv osobi koja ne vidi sliku. "
                    "Uvijek odgovaraj ISKLJUČIVO na hrvatskom jeziku. "
                    "Uključi sljedeće ako je vidljivo na slici: "
                    "objekte, boje, raspored elemenata, položaje u prostoru, radnje, izraz ili atmosferu scene i kontekst."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Na temelju ovog opisa slike napravi detaljan opis na hrvatskom jeziku:\n\n{caption}\n\n"
                    "Objasni što se nalazi na slici tako da slijepa ili slabovidna osoba može što bolje razumjeti sadržaj. "
                    "Opis neka bude detaljan, ali prirodan i lako razumljiv."
                )
            }
        ],
        temperature=0.7
    )

    return response.choices[0].message.content

LLM parameters by mode

Parameter	Standard	Detailed
Model	`llama-3.3-70b-versatile`	`llama-3.3-70b-versatile`
Temperature	0.5	0.7
Output	One short Croatian sentence	Multi-sentence Croatian accessibility description

Use detailed mode (detailed: true) when the image contains important contextual information — for example, a scene with multiple objects, spatial relationships, or expressive content. Standard mode is faster and suitable for simple images where a single sentence conveys all relevant information.

Main agent function

def visual_analysis_agent(state):

    image_path = state["image_path"]
    detailed = state.get("detailed", False)

    image = Image.open(image_path).convert("RGB")

    caption = _blip_caption(image, detailed=False)

    description = _expand_with_llm(caption, detailed)

    return {
        **state,
        "description": description
    }

Note that _blip_caption is always called with detailed=False — the BLIP stage always produces a compact caption regardless of mode. The detailed flag only affects the LLM expansion step.

State fields

Inputs

body.image_path

string

required

Path to the image file on disk. The file is opened with Pillow and converted to RGB before being passed to the BLIP processor.

body.detailed

boolean

default:"false"

When true, the LLM expansion produces a comprehensive multi-sentence accessibility description at temperature 0.7. When false (the default), a single concise Croatian sentence is returned at temperature 0.5.

Outputs

description

string

required

The final Croatian-language image description generated by the Groq LLM. This value is passed directly to the speech agent as the text to synthesise.

Get Started

Architecture

Agents & Tools

Configuration

Visual analysis agent: BLIP captioning and Groq expansion

Stage 1: BLIP captioning

BLIP generation parameters

Stage 2: LLM expansion with Groq

LLM parameters by mode

Main agent function

State fields

Inputs

Outputs

Build docs developers (and LLMs) love

Get Started

Architecture

Agents & Tools

Configuration

Documentation Index

​Stage 1: BLIP captioning

​BLIP generation parameters

​Stage 2: LLM expansion with Groq

​LLM parameters by mode

​Main agent function

​State fields

​Inputs

​Outputs

Build docs developers (and LLMs) love

Stage 1: BLIP captioning

BLIP generation parameters

Stage 2: LLM expansion with Groq

LLM parameters by mode

Main agent function

State fields

Inputs

Outputs