Text-to-Speech: Edge TTS and Gemini Audio Generation

When a user sends a voice note to the bot, the bot processes the audio through Gemini and responds with both a text reply and a spoken OGG voice note generated via TTS. Dummy Gemini Bot supports two TTS engines — Microsoft Edge TTS and Google Gemini TTS — with automatic failover between them, so audio delivery is as reliable as possible even when API quotas are exhausted.

How to Trigger TTS

TTS is triggered automatically whenever a user sends a voice message to the bot:

A user sends a voice note to the bot (in a DM, or in a group where the bot would normally respond).
The bot downloads and processes the audio through Gemini’s audio understanding capabilities.
Gemini generates a text response, which is then fed into the TTS pipeline.
A voice note OGG file is sent back in the same chat as the bot’s reply.

No special command is required — sending a voice note is sufficient to trigger the full voice-in, voice-out pipeline.

Edge TTS Engine

The Edge TTS engine uses Microsoft’s edge-tts library, which streams audio from the same infrastructure that powers Microsoft Edge’s read-aloud feature.

Setting	Value
Default voice	`fa-IR-FaridNeural` (Persian male)
Alternative voice	`fa-IR-DilaraNeural` (Persian female)
Pitch control	`TTS_VOICE_PITCH` (e.g. `0.85` for a deeper voice)
API cost	Free — no quota consumed

Edge TTS is the fastest engine and carries no API quota cost, making it the preferred fallback option.

Gemini TTS Engine

The Gemini TTS engine uses Google’s generative audio models to produce more expressive, prompt-steerable speech. Available voices: Kore, Puck, Fenrir, Aoede, Charon Multiple models can be configured in TTS_GEMINI_MODEL as an ordered list. The bot tries each model in sequence — if the first model fails or hits a quota limit, it automatically moves to the next.

Audio Format

Both engines ultimately produce OGG/Opus files, which are natively compatible with Telegram voice notes. Telegram displays them with a waveform and playback controls. Edge TTS generates an intermediate MP3 file via edge-tts, which is then converted to OGG/Opus by ffmpeg before being sent as a voice note. Gemini’s raw audio output is PCM (audio/L16, 24 kHz, mono). The bot auto-detects this format and converts it to OGG/Opus via ffmpeg before sending. No manual conversion is needed.

Failover Chain

The TTS pipeline follows this ordered failover chain:

Gemini TTS model 1
  → Gemini TTS model 2
    → ... (additional configured models)
      → Edge TTS  (if TTS_FALLBACK_TO_EDGE=True)
        → No voice note sent (all engines failed)

If TTS_FALLBACK_TO_EDGE is False, the chain stops before the Edge TTS step and no voice note is sent if all Gemini models fail.

Edge TTS is faster and has no quota cost. Use Gemini TTS when you need more expressive, emotionally nuanced audio or when the content benefits from prompt-steerable delivery.

Get Started

Configuration

Features

Admin Dashboard

Text-to-Speech: Edge TTS and Gemini Audio Generation

How to Trigger TTS

Edge TTS Engine

Gemini TTS Engine

Audio Format

Failover Chain

Build docs developers (and LLMs) love

Get Started

Configuration

Features

Admin Dashboard

Documentation Index

​How to Trigger TTS

​Edge TTS Engine

​Gemini TTS Engine

​Audio Format

​Failover Chain

Build docs developers (and LLMs) love

How to Trigger TTS

Edge TTS Engine

Gemini TTS Engine

Audio Format

Failover Chain