How AgentForge's three-agent pipeline is structured

AgentForge is built as a multi-agent pipeline where each agent has a single responsibility: the orchestrator validates input and manages caching, the vision agent produces a Croatian text description, and the speech agent converts that text to audio. All three agents share a single typed state object that flows through a LangGraph StateGraph, giving you a clear separation of concerns with predictable data movement between stages.

The three-agent pipeline

Orchestrator

Validates the image, computes a SHA-256 hash, and checks the in-memory session cache. Returns cached results immediately for duplicate images, or passes a clean state forward to the vision agent.

Vision

Uses a BLIP image captioning model to generate an initial English caption, then calls the Groq LLM (llama-3.3-70b-versatile) to expand it into a fluent Croatian description in either concise or detailed mode.

Speech

Calls Microsoft Edge TTS with the hr-HR-GabrijelaNeural Croatian voice and writes the result to an MP3 file. The output path is stored in audio_path on the shared state.

AgentState — the shared data contract

Every agent reads from and writes to a single AgentState TypedDict. No agent holds private mutable data; all information moves through this structure.

state.py

from typing import TypedDict, Optional, List, Dict, Any


class AgentState(TypedDict):
    image_path: str
    session_id: str

    image_hash: Optional[str]

    user_prompt: str
    detailed: bool

    description: Optional[str]
    audio_path: Optional[str]

    history: List[Dict[str, Any]]

    valid_image: Optional[bool]
    error: Optional[str]

Field	Type	Description
`image_path`	`str`	Absolute or relative path to the image file passed in by the caller.
`session_id`	`str`	Identifies the user session. Used as the key into the in-memory `SessionMemory` store.
`image_hash`	`Optional[str]`	SHA-256 hex digest of the image file. Set by the orchestrator after a successful validation.
`user_prompt`	`str`	Prompt forwarded to the LLM. Defaults to `"Describe image"`.
`detailed`	`bool`	When `True`, the LLM produces a multi-sentence description instead of a single sentence.
`description`	`Optional[str]`	The Croatian text description produced by the vision agent, or retrieved from cache.
`audio_path`	`Optional[str]`	File path of the generated MP3 audio, set by the speech agent.
`history`	`List[Dict[str, Any]]`	Per-session conversation history loaded from `SessionMemory` at the start of each run.
`valid_image`	`Optional[bool]`	Set to `True` by the orchestrator when the image passes validation, or `False` on failure.
`error`	`Optional[str]`	Human-readable error message populated when `valid_image` is `False`.

LangGraph StateGraph

The workflow is assembled in workflow.py using LangGraph’s StateGraph. Three nodes are registered — "orchestrator", "vision", and "speech" — and connected with both unconditional and conditional edges. The graph is compiled once at import time and reused for every invocation.

workflow.py

from langgraph.graph import StateGraph, START, END
from backend.graph.state import AgentState

builder = StateGraph(AgentState)

builder.add_node("orchestrator", orchestrator_agent)
builder.add_node("vision", vision_node)
builder.add_node("speech", speech_agent)

builder.add_edge(START, "orchestrator")
builder.add_conditional_edges("orchestrator", route)
builder.add_edge("vision", "speech")
builder.add_edge("speech", END)

workflow = builder.compile()

The route function inspects state["valid_image"] after the orchestrator runs. If the value is False, the graph terminates immediately; otherwise execution continues to the vision node.

MCP-style tool abstraction

The vision node does not call the visual analysis agent directly. Instead it goes through describe_image_tool from backend/tools/mcp_tools.py, which wraps the agent in a model-context-protocol style interface. This abstraction means the vision agent can be swapped for a different model or API without touching the graph definition.

workflow.py

def vision_node(state):
    return describe_image_tool(visual_analysis_agent, state)

Every agent returns a new dictionary using the {**state, ...new_fields} pattern rather than mutating the state in place. This means each node receives the full previous state merged with any updates, and no agent ever discards fields set by an earlier stage. For example, orchestrator_agent always returns {**state, "valid_image": True} so that image_path, session_id, image_hash, and all other fields remain present for the vision and speech agents downstream.

Get Started

Architecture

Agents & Tools

Configuration

How AgentForge's three-agent pipeline is structured

The three-agent pipeline

Orchestrator

Vision

Speech

AgentState — the shared data contract

LangGraph StateGraph

MCP-style tool abstraction

Build docs developers (and LLMs) love

Get Started

Architecture

Agents & Tools

Configuration

Documentation Index

​The three-agent pipeline

Orchestrator

Vision

Speech

​AgentState — the shared data contract

​LangGraph StateGraph

​MCP-style tool abstraction

Build docs developers (and LLMs) love

The three-agent pipeline

AgentState — the shared data contract

LangGraph StateGraph

MCP-style tool abstraction