AgentForge is an accessibility-focused AI application that takes any image and produces a natural-language description in Croatian, then converts it to audio so that blind and visually impaired users can understand visual content. It combines a computer vision model, a large language model, and a text-to-speech engine in a coordinated multi-agent pipeline built with LangGraph.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dominikKos9/AgentForge/llms.txt
Use this file to discover all available pages before exploring further.
Quickstart
Install dependencies, configure your API key, and generate your first image description in minutes.
Architecture overview
Understand how the orchestrator, vision, and speech agents work together in the LangGraph workflow.
Agents & Tools
Explore each agent’s role — orchestration, visual analysis, speech synthesis, and supporting tools.
Configuration
Set up your environment variables and understand model configuration options.
How it works
AgentForge processes an image through three sequential agents managed by a LangGraph state graph:Upload an image
The Streamlit web interface accepts JPEG, PNG, or WEBP images. Each session gets a unique ID so history is kept separate per user.
Orchestrator validates and routes
The orchestrator agent validates the image format, computes a SHA-256 hash, and checks whether a cached result already exists for that image. If found, the cached description and audio are returned immediately.
Visual agent generates a description
The BLIP image captioning model produces an initial English caption. The Groq LLM (
llama-3.3-70b-versatile) then expands this into a fluent Croatian description — either a concise single sentence or a detailed multi-sentence account, depending on user preference.Key features
Croatian language output
Descriptions and audio are always produced in Croatian, designed specifically for Croatian-speaking blind and visually impaired users.
Concise and detailed modes
Users can toggle a checkbox to choose between a brief one-sentence description or a rich, multi-sentence detailed account.
Result caching
Images are hashed on upload. Identical images return instantly from the in-memory session cache — no redundant model inference.
Session history
The UI keeps the last five descriptions in session memory, allowing users to revisit previous results without re-uploading.