Model configuration reference: BLIP, Groq LLM, and TTS

AgentForge chains three AI models in sequence to turn an image into spoken Croatian audio. BLIP generates an initial English caption from the image, the Groq LLM rewrites that caption into a fluent Croatian description, and Edge TTS converts the description to an MP3 audio file. Each model has its own configuration parameters, and the detailed flag passed through the LangGraph state controls the behaviour of both BLIP and Groq.

BLIP vision model

The visual agent loads the Salesforce/blip-image-captioning-large model from HuggingFace Transformers at startup:

backend/agents/visual_agent.py

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
model.eval()

The _blip_caption function runs inference and its generation parameters vary based on the detailed flag:

backend/agents/visual_agent.py

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_length=120 if detailed else 60,
        num_beams=5 if detailed else 3,
        temperature=1.0,
        repetition_penalty=1.2
    )

body.max_length

number

default:"60"

Maximum number of tokens in the generated caption. Set to 120 when detailed=True, 60 when detailed=False.

body.num_beams

number

default:"3"

Beam search width. Higher values improve caption quality at the cost of inference time. Set to 5 when detailed=True, 3 when detailed=False.

body.temperature

number

default:"1.0"

Sampling temperature for token generation. Fixed at 1.0 regardless of mode.

body.repetition_penalty

number

default:"1.2"

Penalises repeated tokens to reduce repetitive output. Fixed at 1.2 regardless of mode.

BLIP runs entirely on CPU using torch.no_grad(), which disables gradient computation to reduce memory usage during inference. The model weights are downloaded automatically by HuggingFace Transformers on first run and cached locally — no manual download is required.

Groq LLM

The visual agent uses the llama-3.3-70b-versatile model via the Groq API to expand the BLIP caption into a natural Croatian description. The system prompt and temperature differ between standard and detailed modes. Standard mode — produces a single concise Croatian sentence:

backend/agents/visual_agent.py

response = groq_client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {
            "role": "system",
            "content": (
                "Ti si pomoćnik za pristupačnost slijepim i slabovidnim osobama. "
                "Tvoj zadatak je pretvoriti opis slike u prirodan i kratak opis na hrvatskom jeziku. "
                "Uvijek odgovaraj ISKLJUČIVO na hrvatskom jeziku."
            )
        },
        {
            "role": "user",
            "content": (
                f"Prevedi i prirodno opiši ovu sliku na hrvatskom jeziku:\n\n{caption}\n\n"
                "Napiši jednu kratku i jasnu rečenicu."
            )
        }
    ],
    temperature=0.5
)

Detailed mode — produces a multi-sentence description covering objects, colours, spatial layout, actions, and atmosphere:

backend/agents/visual_agent.py

response = groq_client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {
            "role": "system",
            "content": (
                "Ti si pomoćnik za pristupačnost slijepim i slabovidnim osobama. "
                "Tvoj zadatak je generirati detaljan i koristan opis slike na hrvatskom jeziku. "
                "Opis mora biti prirodan, jasan i lako razumljiv osobi koja ne vidi sliku. "
                "Uvijek odgovaraj ISKLJUČIVO na hrvatskom jeziku. "
                "Uključi sljedeće ako je vidljivo na slici: "
                "objekte, boje, raspored elemenata, položaje u prostoru, radnje, izraz ili atmosferu scene i kontekst."
            )
        },
        {
            "role": "user",
            "content": (
                f"Na temelju ovog opisa slike napravi detaljan opis na hrvatskom jeziku:\n\n{caption}\n\n"
                "Objasni što se nalazi na slici tako da slijepa ili slabovidna osoba može što bolje razumjeti sadržaj. "
                "Opis neka bude detaljan, ali prirodan i lako razumljiv."
            )
        }
    ],
    temperature=0.7
)

body.model

string

default:"llama-3.3-70b-versatile"

The Groq model identifier. Fixed at llama-3.3-70b-versatile for both modes.

body.temperature

number

default:"0.5"

Controls output randomness. Set to 0.5 in standard mode for concise, focused output, and 0.7 in detailed mode to allow richer, more varied descriptions.

The Groq API runs llama-3.3-70b-versatile on dedicated hardware and returns responses in well under a second on most requests. The free tier provides generous monthly token limits that are sufficient for typical AgentForge usage without a paid subscription.

Edge TTS voice

The speech agent converts the Croatian description to audio using Microsoft Edge TTS via the edge-tts Python package:

backend/agents/speech_agent.py

communicate = edge_tts.Communicate(
    text=text,
    voice="hr-HR-GabrijelaNeural"
)
await communicate.save(path)

The output file is saved as an MP3 at data/{session_id}/output.mp3, where session_id is the unique identifier assigned to each Streamlit session.

body.voice

string

default:"hr-HR-GabrijelaNeural"

The Microsoft Edge TTS voice identifier. hr-HR-GabrijelaNeural is a Croatian female neural voice. This value is fixed in speech_agent.py and is not currently user-configurable.

body.output_path

string

default:"data/{session_id}/output.mp3"

Path where the generated MP3 is saved. The directory is created automatically if it does not exist.

edge-tts does not require an API key. It communicates with Microsoft’s public Edge TTS service, which is freely accessible without authentication. No additional credentials beyond the Groq API key are needed to run AgentForge.

Get Started

Architecture

Agents & Tools

Configuration

Model configuration reference: BLIP, Groq LLM, and TTS

BLIP vision model

Groq LLM

Edge TTS voice

Build docs developers (and LLMs) love

Get Started

Architecture

Agents & Tools

Configuration

Documentation Index

​BLIP vision model

​Groq LLM

​Edge TTS voice

Build docs developers (and LLMs) love

BLIP vision model

Groq LLM

Edge TTS voice