AgentForge is built as a multi-agent pipeline where each agent has a single responsibility: the orchestrator validates input and manages caching, the vision agent produces a Croatian text description, and the speech agent converts that text to audio. All three agents share a single typed state object that flows through a LangGraphDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/dominikKos9/AgentForge/llms.txt
Use this file to discover all available pages before exploring further.
StateGraph, giving you a clear separation of concerns with predictable data movement between stages.
The three-agent pipeline
Orchestrator
Validates the image, computes a SHA-256 hash, and checks the in-memory session cache. Returns cached results immediately for duplicate images, or passes a clean state forward to the vision agent.
Vision
Uses a BLIP image captioning model to generate an initial English caption, then calls the Groq LLM (
llama-3.3-70b-versatile) to expand it into a fluent Croatian description in either concise or detailed mode.Speech
Calls Microsoft Edge TTS with the
hr-HR-GabrijelaNeural Croatian voice and writes the result to an MP3 file. The output path is stored in audio_path on the shared state.AgentState — the shared data contract
Every agent reads from and writes to a singleAgentState TypedDict. No agent holds private mutable data; all information moves through this structure.
state.py
| Field | Type | Description |
|---|---|---|
image_path | str | Absolute or relative path to the image file passed in by the caller. |
session_id | str | Identifies the user session. Used as the key into the in-memory SessionMemory store. |
image_hash | Optional[str] | SHA-256 hex digest of the image file. Set by the orchestrator after a successful validation. |
user_prompt | str | Prompt forwarded to the LLM. Defaults to "Describe image". |
detailed | bool | When True, the LLM produces a multi-sentence description instead of a single sentence. |
description | Optional[str] | The Croatian text description produced by the vision agent, or retrieved from cache. |
audio_path | Optional[str] | File path of the generated MP3 audio, set by the speech agent. |
history | List[Dict[str, Any]] | Per-session conversation history loaded from SessionMemory at the start of each run. |
valid_image | Optional[bool] | Set to True by the orchestrator when the image passes validation, or False on failure. |
error | Optional[str] | Human-readable error message populated when valid_image is False. |
LangGraph StateGraph
The workflow is assembled inworkflow.py using LangGraph’s StateGraph. Three nodes are registered — "orchestrator", "vision", and "speech" — and connected with both unconditional and conditional edges. The graph is compiled once at import time and reused for every invocation.
workflow.py
route function inspects state["valid_image"] after the orchestrator runs. If the value is False, the graph terminates immediately; otherwise execution continues to the vision node.
MCP-style tool abstraction
The vision node does not call the visual analysis agent directly. Instead it goes throughdescribe_image_tool from backend/tools/mcp_tools.py, which wraps the agent in a model-context-protocol style interface. This abstraction means the vision agent can be swapped for a different model or API without touching the graph definition.
workflow.py
Every agent returns a new dictionary using the
{**state, ...new_fields} pattern rather than mutating the state in place. This means each node receives the full previous state merged with any updates, and no agent ever discards fields set by an earlier stage. For example, orchestrator_agent always returns {**state, "valid_image": True} so that image_path, session_id, image_hash, and all other fields remain present for the vision and speech agents downstream.