AgentForge chains three AI models in sequence to turn an image into spoken Croatian audio. BLIP generates an initial English caption from the image, the Groq LLM rewrites that caption into a fluent Croatian description, and Edge TTS converts the description to an MP3 audio file. Each model has its own configuration parameters, and theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/dominikKos9/AgentForge/llms.txt
Use this file to discover all available pages before exploring further.
detailed flag passed through the LangGraph state controls the behaviour of both BLIP and Groq.
BLIP vision model
The visual agent loads theSalesforce/blip-image-captioning-large model from HuggingFace Transformers at startup:
backend/agents/visual_agent.py
_blip_caption function runs inference and its generation parameters vary based on the detailed flag:
backend/agents/visual_agent.py
Maximum number of tokens in the generated caption. Set to
120 when detailed=True, 60 when detailed=False.Beam search width. Higher values improve caption quality at the cost of inference time. Set to
5 when detailed=True, 3 when detailed=False.Sampling temperature for token generation. Fixed at
1.0 regardless of mode.Penalises repeated tokens to reduce repetitive output. Fixed at
1.2 regardless of mode.BLIP runs entirely on CPU using
torch.no_grad(), which disables gradient computation to reduce memory usage during inference. The model weights are downloaded automatically by HuggingFace Transformers on first run and cached locally — no manual download is required.Groq LLM
The visual agent uses thellama-3.3-70b-versatile model via the Groq API to expand the BLIP caption into a natural Croatian description. The system prompt and temperature differ between standard and detailed modes.
Standard mode — produces a single concise Croatian sentence:
backend/agents/visual_agent.py
backend/agents/visual_agent.py
The Groq model identifier. Fixed at
llama-3.3-70b-versatile for both modes.Controls output randomness. Set to
0.5 in standard mode for concise, focused output, and 0.7 in detailed mode to allow richer, more varied descriptions.Edge TTS voice
The speech agent converts the Croatian description to audio using Microsoft Edge TTS via theedge-tts Python package:
backend/agents/speech_agent.py
data/{session_id}/output.mp3, where session_id is the unique identifier assigned to each Streamlit session.
The Microsoft Edge TTS voice identifier.
hr-HR-GabrijelaNeural is a Croatian female neural voice. This value is fixed in speech_agent.py and is not currently user-configurable.Path where the generated MP3 is saved. The directory is created automatically if it does not exist.
edge-tts does not require an API key. It communicates with Microsoft’s public Edge TTS service, which is freely accessible without authentication. No additional credentials beyond the Groq API key are needed to run AgentForge.