Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dominikKos9/AgentForge/llms.txt

Use this file to discover all available pages before exploring further.

AgentForge is built as a multi-agent pipeline where each agent has a single responsibility: the orchestrator validates input and manages caching, the vision agent produces a Croatian text description, and the speech agent converts that text to audio. All three agents share a single typed state object that flows through a LangGraph StateGraph, giving you a clear separation of concerns with predictable data movement between stages.

The three-agent pipeline

Orchestrator

Validates the image, computes a SHA-256 hash, and checks the in-memory session cache. Returns cached results immediately for duplicate images, or passes a clean state forward to the vision agent.

Vision

Uses a BLIP image captioning model to generate an initial English caption, then calls the Groq LLM (llama-3.3-70b-versatile) to expand it into a fluent Croatian description in either concise or detailed mode.

Speech

Calls Microsoft Edge TTS with the hr-HR-GabrijelaNeural Croatian voice and writes the result to an MP3 file. The output path is stored in audio_path on the shared state.

AgentState — the shared data contract

Every agent reads from and writes to a single AgentState TypedDict. No agent holds private mutable data; all information moves through this structure.
state.py
from typing import TypedDict, Optional, List, Dict, Any


class AgentState(TypedDict):
    image_path: str
    session_id: str

    image_hash: Optional[str]

    user_prompt: str
    detailed: bool

    description: Optional[str]
    audio_path: Optional[str]

    history: List[Dict[str, Any]]

    valid_image: Optional[bool]
    error: Optional[str]
FieldTypeDescription
image_pathstrAbsolute or relative path to the image file passed in by the caller.
session_idstrIdentifies the user session. Used as the key into the in-memory SessionMemory store.
image_hashOptional[str]SHA-256 hex digest of the image file. Set by the orchestrator after a successful validation.
user_promptstrPrompt forwarded to the LLM. Defaults to "Describe image".
detailedboolWhen True, the LLM produces a multi-sentence description instead of a single sentence.
descriptionOptional[str]The Croatian text description produced by the vision agent, or retrieved from cache.
audio_pathOptional[str]File path of the generated MP3 audio, set by the speech agent.
historyList[Dict[str, Any]]Per-session conversation history loaded from SessionMemory at the start of each run.
valid_imageOptional[bool]Set to True by the orchestrator when the image passes validation, or False on failure.
errorOptional[str]Human-readable error message populated when valid_image is False.

LangGraph StateGraph

The workflow is assembled in workflow.py using LangGraph’s StateGraph. Three nodes are registered — "orchestrator", "vision", and "speech" — and connected with both unconditional and conditional edges. The graph is compiled once at import time and reused for every invocation.
workflow.py
from langgraph.graph import StateGraph, START, END
from backend.graph.state import AgentState

builder = StateGraph(AgentState)

builder.add_node("orchestrator", orchestrator_agent)
builder.add_node("vision", vision_node)
builder.add_node("speech", speech_agent)

builder.add_edge(START, "orchestrator")
builder.add_conditional_edges("orchestrator", route)
builder.add_edge("vision", "speech")
builder.add_edge("speech", END)

workflow = builder.compile()
The route function inspects state["valid_image"] after the orchestrator runs. If the value is False, the graph terminates immediately; otherwise execution continues to the vision node.

MCP-style tool abstraction

The vision node does not call the visual analysis agent directly. Instead it goes through describe_image_tool from backend/tools/mcp_tools.py, which wraps the agent in a model-context-protocol style interface. This abstraction means the vision agent can be swapped for a different model or API without touching the graph definition.
workflow.py
def vision_node(state):
    return describe_image_tool(visual_analysis_agent, state)
Every agent returns a new dictionary using the {**state, ...new_fields} pattern rather than mutating the state in place. This means each node receives the full previous state merged with any updates, and no agent ever discards fields set by an earlier stage. For example, orchestrator_agent always returns {**state, "valid_image": True} so that image_path, session_id, image_hash, and all other fields remain present for the vision and speech agents downstream.

Build docs developers (and LLMs) love