Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dominikKos9/AgentForge/llms.txt

Use this file to discover all available pages before exploring further.

AgentForge chains three AI models in sequence to turn an image into spoken Croatian audio. BLIP generates an initial English caption from the image, the Groq LLM rewrites that caption into a fluent Croatian description, and Edge TTS converts the description to an MP3 audio file. Each model has its own configuration parameters, and the detailed flag passed through the LangGraph state controls the behaviour of both BLIP and Groq.

BLIP vision model

The visual agent loads the Salesforce/blip-image-captioning-large model from HuggingFace Transformers at startup:
backend/agents/visual_agent.py
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
model.eval()
The _blip_caption function runs inference and its generation parameters vary based on the detailed flag:
backend/agents/visual_agent.py
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_length=120 if detailed else 60,
        num_beams=5 if detailed else 3,
        temperature=1.0,
        repetition_penalty=1.2
    )
body.max_length
number
default:"60"
Maximum number of tokens in the generated caption. Set to 120 when detailed=True, 60 when detailed=False.
body.num_beams
number
default:"3"
Beam search width. Higher values improve caption quality at the cost of inference time. Set to 5 when detailed=True, 3 when detailed=False.
body.temperature
number
default:"1.0"
Sampling temperature for token generation. Fixed at 1.0 regardless of mode.
body.repetition_penalty
number
default:"1.2"
Penalises repeated tokens to reduce repetitive output. Fixed at 1.2 regardless of mode.
BLIP runs entirely on CPU using torch.no_grad(), which disables gradient computation to reduce memory usage during inference. The model weights are downloaded automatically by HuggingFace Transformers on first run and cached locally — no manual download is required.

Groq LLM

The visual agent uses the llama-3.3-70b-versatile model via the Groq API to expand the BLIP caption into a natural Croatian description. The system prompt and temperature differ between standard and detailed modes. Standard mode — produces a single concise Croatian sentence:
backend/agents/visual_agent.py
response = groq_client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {
            "role": "system",
            "content": (
                "Ti si pomoćnik za pristupačnost slijepim i slabovidnim osobama. "
                "Tvoj zadatak je pretvoriti opis slike u prirodan i kratak opis na hrvatskom jeziku. "
                "Uvijek odgovaraj ISKLJUČIVO na hrvatskom jeziku."
            )
        },
        {
            "role": "user",
            "content": (
                f"Prevedi i prirodno opiši ovu sliku na hrvatskom jeziku:\n\n{caption}\n\n"
                "Napiši jednu kratku i jasnu rečenicu."
            )
        }
    ],
    temperature=0.5
)
Detailed mode — produces a multi-sentence description covering objects, colours, spatial layout, actions, and atmosphere:
backend/agents/visual_agent.py
response = groq_client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {
            "role": "system",
            "content": (
                "Ti si pomoćnik za pristupačnost slijepim i slabovidnim osobama. "
                "Tvoj zadatak je generirati detaljan i koristan opis slike na hrvatskom jeziku. "
                "Opis mora biti prirodan, jasan i lako razumljiv osobi koja ne vidi sliku. "
                "Uvijek odgovaraj ISKLJUČIVO na hrvatskom jeziku. "
                "Uključi sljedeće ako je vidljivo na slici: "
                "objekte, boje, raspored elemenata, položaje u prostoru, radnje, izraz ili atmosferu scene i kontekst."
            )
        },
        {
            "role": "user",
            "content": (
                f"Na temelju ovog opisa slike napravi detaljan opis na hrvatskom jeziku:\n\n{caption}\n\n"
                "Objasni što se nalazi na slici tako da slijepa ili slabovidna osoba može što bolje razumjeti sadržaj. "
                "Opis neka bude detaljan, ali prirodan i lako razumljiv."
            )
        }
    ],
    temperature=0.7
)
body.model
string
default:"llama-3.3-70b-versatile"
The Groq model identifier. Fixed at llama-3.3-70b-versatile for both modes.
body.temperature
number
default:"0.5"
Controls output randomness. Set to 0.5 in standard mode for concise, focused output, and 0.7 in detailed mode to allow richer, more varied descriptions.
The Groq API runs llama-3.3-70b-versatile on dedicated hardware and returns responses in well under a second on most requests. The free tier provides generous monthly token limits that are sufficient for typical AgentForge usage without a paid subscription.

Edge TTS voice

The speech agent converts the Croatian description to audio using Microsoft Edge TTS via the edge-tts Python package:
backend/agents/speech_agent.py
communicate = edge_tts.Communicate(
    text=text,
    voice="hr-HR-GabrijelaNeural"
)
await communicate.save(path)
The output file is saved as an MP3 at data/{session_id}/output.mp3, where session_id is the unique identifier assigned to each Streamlit session.
body.voice
string
default:"hr-HR-GabrijelaNeural"
The Microsoft Edge TTS voice identifier. hr-HR-GabrijelaNeural is a Croatian female neural voice. This value is fixed in speech_agent.py and is not currently user-configurable.
body.output_path
string
default:"data/{session_id}/output.mp3"
Path where the generated MP3 is saved. The directory is created automatically if it does not exist.
edge-tts does not require an API key. It communicates with Microsoft’s public Edge TTS service, which is freely accessible without authentication. No additional credentials beyond the Groq API key are needed to run AgentForge.

Build docs developers (and LLMs) love