[app], [whisper], [azure], and [siliconflow] sections of config.toml.
Voice settings
The voice engine is selected by thevoice_name value you pass when generating a video (via the Web UI or API). The name prefix determines which TTS backend is used:
| Voice name prefix | Backend | Config required |
|---|---|---|
No prefix (e.g. en-US-JennyNeural-Female) | Edge TTS | None |
No prefix, name ends in -V2-* (e.g. en-US-AvaMultilingualNeural-V2-Female) | Azure Cognitive Services | [azure] section |
siliconflow: (e.g. siliconflow:FunAudioLLM/CosyVoice2-0.5B:alex-Male) | SiliconFlow | [siliconflow] section |
gemini: (e.g. gemini:Zephyr-Female) | Google Gemini TTS | gemini_api_key in [app] |
voice_name
The exact voice identifier string. You select this in the Web UI from a dropdown that includes a real-time preview. When calling the API directly, pass the voice name as a parameter in your request body.
voice_volume
Controls the output audio volume. Accepts a float between 0.0 (silent) and 1.0 (full volume). Default is 1.0.
voice_rate
A speech rate multiplier. 1.0 is normal speed. Values above 1.0 speed up delivery; values below slow it down.
| Value | Effect |
|---|---|
0.75 | 25% slower |
1.0 | Normal (default) |
1.25 | 25% faster |
Edge TTS voices
Edge TTS is the default voice engine. It requires no API key and supports a wide range of languages and locales (400+ voices).en-US-JennyNeural-Femaleen-US-GuyNeural-Malezh-CN-XiaoxiaoNeural-Femalede-DE-KatjaNeural-Female
Azure TTS voices
Azure Neural voices (including multilingual V2 variants) require an Azure Cognitive Services Speech resource. Get your key at portal.azure.com.-V2-Female or -V2-Male) become available in the voice dropdown.
Example Azure V2 voice names:
en-US-AvaMultilingualNeural-V2-Femaleen-US-AndrewMultilingualNeural-V2-Malezh-CN-XiaoxiaoMultilingualNeural-V2-Female
Standard Azure voices (without
-V2) use Edge TTS internally and do not require Azure credentials. Only the -V2 multilingual voices require the [azure] section to be configured.SiliconFlow TTS voices
SiliconFlow provides high-quality Chinese and multilingual voices via the CosyVoice2 model. Get your API key at siliconflow.cn.siliconflow:FunAudioLLM/CosyVoice2-0.5B:alex-Malesiliconflow:FunAudioLLM/CosyVoice2-0.5B:anna-Femalesiliconflow:FunAudioLLM/CosyVoice2-0.5B:bella-Femalesiliconflow:FunAudioLLM/CosyVoice2-0.5B:benjamin-Malesiliconflow:FunAudioLLM/CosyVoice2-0.5B:charles-Malesiliconflow:FunAudioLLM/CosyVoice2-0.5B:claire-Femalesiliconflow:FunAudioLLM/CosyVoice2-0.5B:david-Malesiliconflow:FunAudioLLM/CosyVoice2-0.5B:diana-Female
Gemini TTS voices
Gemini TTS uses thegemini-2.5-flash-preview-tts model and shares the gemini_api_key from the [app] section.
gemini:Zephyr-Femalegemini:Puck-Malegemini:Aoede-Femalegemini:Orion-Male
Gemini TTS requires
pydub to be installed. Run pip install pydub if you see an import error when using a gemini: voice.Subtitle settings
Setsubtitle_provider in config.toml to control how (or whether) subtitles are generated.
- Edge (recommended)
- Whisper
- Disabled
subtitle_provider = “edge”
Subtitle timing is derived directly from Edge TTS word-boundary events during audio synthesis. This is the fastest option and requires no additional downloads.- Fast — no extra processing step
- Works with any Edge TTS or Azure TTS voice
- Timing accuracy is good for most use cases
- No large model download required
Subtitle appearance
Font, size, colour, and position are configured per video at generation time, not globally inconfig.toml. You can set these values:
- Web UI: Use the subtitle style controls in the video generation form.
- API: Pass the relevant parameters in your
/api/v1/videosrequest body.