Speech agent: Edge TTS audio synthesis for Croatian

The speech agent is the final processing node in the AgentForge pipeline. It takes the Croatian text description produced by the visual analysis agent and synthesises it into an MP3 audio file using Microsoft Edge TTS. Because Edge TTS exposes an async API, the agent wraps the coroutine with asyncio.run() to keep the LangGraph node interface synchronous. The resulting file is written to a session-scoped directory under data/ and its path is stored in state so the caller can serve or stream it.

import os
import asyncio
import edge_tts


async def _tts(text, path):
    communicate = edge_tts.Communicate(
        text=text,
        voice="hr-HR-GabrijelaNeural"
    )
    await communicate.save(path)


def speech_agent(state):

    session_id = state["session_id"]
    text = state["description"]

    output_dir = f"data/{session_id}"
    os.makedirs(output_dir, exist_ok=True)

    audio_path = os.path.join(output_dir, "output.mp3")

    asyncio.run(_tts(text, audio_path))

    return {
        **state,
        "audio_path": audio_path
    }

How the async bridge works

_tts(text, path) is an async coroutine. It constructs an edge_tts.Communicate instance with the target voice, then calls await communicate.save(path) which streams the audio from the Microsoft Edge TTS service and writes it to disk. speech_agent is a plain synchronous function (as required by LangGraph node semantics). It calls asyncio.run(_tts(...)) to create a new event loop, run the coroutine to completion, and return. This pattern is safe as long as no outer event loop is already running in the same thread; if the agent is invoked from an async context, use await _tts(...) directly instead.

Voice and audio output

The voice used is hr-HR-GabrijelaNeural — a Croatian female neural voice provided by Microsoft Edge TTS. It produces natural-sounding speech suitable for accessibility use cases. Audio is saved to data/{session_id}/output.mp3. The directory is created with os.makedirs(..., exist_ok=True) so no manual setup is required. If a previous run for the same session already wrote a file to this path, it is overwritten.

State fields

Inputs

body.session_id

string

required

Used to construct the output directory path data/{session_id}/. Each session writes its audio file to an isolated subdirectory.

body.description

string

required

The Croatian text to synthesise. This is the value written by the visual analysis agent in the previous pipeline step.

Outputs

audio_path

string

required

Relative path to the generated MP3 file, always of the form data/{session_id}/output.mp3. The caller can use this path to read, stream, or serve the audio file.

Audio files accumulate in the data/ directory, one subdirectory per session ID. There is no automatic cleanup: long-running deployments should implement a periodic job to remove old session directories or store the files in an object storage bucket and delete them after expiry.

Get Started

Architecture

Agents & Tools

Configuration

Speech agent: Edge TTS audio synthesis for Croatian

How the async bridge works

Voice and audio output

State fields

Inputs

Outputs

Build docs developers (and LLMs) love

Get Started

Architecture

Agents & Tools

Configuration

Documentation Index

​How the async bridge works

​Voice and audio output

​State fields

​Inputs

​Outputs

Build docs developers (and LLMs) love

How the async bridge works

Voice and audio output

State fields

Inputs

Outputs