AgentForge: AI image descriptions for accessibility

AgentForge is an accessibility-focused AI application that takes any image and produces a natural-language description in Croatian, then converts it to audio so that blind and visually impaired users can understand visual content. It combines a computer vision model, a large language model, and a text-to-speech engine in a coordinated multi-agent pipeline built with LangGraph.

Quickstart

Install dependencies, configure your API key, and generate your first image description in minutes.

Architecture overview

Understand how the orchestrator, vision, and speech agents work together in the LangGraph workflow.

Agents & Tools

Explore each agent’s role — orchestration, visual analysis, speech synthesis, and supporting tools.

Configuration

Set up your environment variables and understand model configuration options.

How it works

AgentForge processes an image through three sequential agents managed by a LangGraph state graph:

Upload an image

The Streamlit web interface accepts JPEG, PNG, or WEBP images. Each session gets a unique ID so history is kept separate per user.

Orchestrator validates and routes

The orchestrator agent validates the image format, computes a SHA-256 hash, and checks whether a cached result already exists for that image. If found, the cached description and audio are returned immediately.

Visual agent generates a description

The BLIP image captioning model produces an initial English caption. The Groq LLM (llama-3.3-70b-versatile) then expands this into a fluent Croatian description — either a concise single sentence or a detailed multi-sentence account, depending on user preference.

Speech agent converts to audio

The speech agent calls Microsoft Edge TTS using the hr-HR-GabrijelaNeural Croatian voice and saves an MP3 file. The Streamlit UI then plays the audio and displays the text description.

Key features

Croatian language output

Descriptions and audio are always produced in Croatian, designed specifically for Croatian-speaking blind and visually impaired users.

Concise and detailed modes

Users can toggle a checkbox to choose between a brief one-sentence description or a rich, multi-sentence detailed account.

Result caching

Images are hashed on upload. Identical images return instantly from the in-memory session cache — no redundant model inference.

Session history

The UI keeps the last five descriptions in session memory, allowing users to revisit previous results without re-uploading.

Quickstart: run AgentForge locally in five minutes

Get Started

Architecture

Agents & Tools

Configuration

AgentForge: AI image descriptions for accessibility

Quickstart

Architecture overview

Agents & Tools

Configuration

How it works

Key features

Croatian language output

Concise and detailed modes

Result caching

Session history

Build docs developers (and LLMs) love

Get Started

Architecture

Agents & Tools

Configuration

Documentation Index

Quickstart

Architecture overview

Agents & Tools

Configuration

​How it works

​Key features

Croatian language output

Concise and detailed modes

Result caching

Session history

Build docs developers (and LLMs) love

How it works

Key features