Applio’s TTS feature combines two distinct systems to give you fully synthetic speech in any trained voice: first, Microsoft Edge TTS (via theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/IAHispano/Applio/llms.txt
Use this file to discover all available pages before exploring further.
edge-tts library) converts your text to natural-sounding speech using one of hundreds of neural voices, then the output WAV is immediately passed through the standard RVC inference pipeline to re-voice it using your chosen .pth model. The final result is an audio file that sounds like the target model speaker saying your text. Both the intermediate TTS audio and the final RVC-converted audio are saved separately, so you can inspect or use either one.
The TTS voice list is fetched from the Bing TTS endpoint (
https://speech.platform.bing.com/consumer/speech/synthesize/readaloud/voices/list) and cached locally in rvc/lib/tools/tts_voices.json. This file ships with Applio and is loaded at startup via load_voices_data() in core.py. The ShortName field of each voice entry (e.g. en-US-AriaNeural) is what you pass as tts_voice.How the Pipeline Works
Text to Speech (Edge TTS)
Applio calls
rvc/lib/tools/tts.py as a subprocess, passing your text (or a UTF-8 text file), the selected Edge TTS voice short name, and a speech rate adjustment. Edge TTS synthesises a WAV file and saves it to output_tts_path (default: assets/audios/tts_output.wav). If a file already exists at that path it is deleted first.Parameters
The text string to synthesise. Used when you type directly in the “Text to Speech” tab. Mutually exclusive with
tts_file — if both are provided, tts_file takes precedence when a valid path is given.Path to a UTF-8 encoded
.txt file containing the text to synthesise. The full file content is passed to Edge TTS. The file must be UTF-8; other encodings will cause an error.The Edge TTS voice short name to use for synthesis (e.g.
en-US-AriaNeural, es-ES-AlvaroNeural, ja-JP-NanamiNeural). The complete list of available voice short names is loaded from rvc/lib/tools/tts_voices.json at startup.Adjusts the Edge TTS speech rate as a percentage relative to the voice’s normal speed. The range is -100 to +100. A value of
0 uses the default speed; -50 speaks at half speed; +50 speeds up by 50%.Path where the intermediate Edge TTS audio is saved before RVC conversion.
Path where the final RVC-converted audio is saved.
Path to the
.pth model file used for voice conversion.Path to the
.index file paired with the model.Pitch shift in semitones applied during the RVC conversion step (-24 to +24).
Pitch extraction algorithm for the RVC conversion step. Choices:
crepe, crepe-tiny, rmvpe, fcpe.Index file influence during RVC conversion (0.0–1.0). The TTS tab defaults to
0.75 (higher than the standard inference default of 0.3) because synthesised speech is clean and benefits from stronger index guidance.Volume envelope blending for the RVC output (0.0–1.0).
Consonant and breath protection level for RVC conversion (0.0–0.5). The TTS tab defaults to
0.5 (maximum protection) since synthesised TTS speech is already clean.Whether to split the TTS output into chunks before RVC conversion. Useful for long texts.
Apply autotune to the RVC output. Useful when converting song lyrics via TTS.
Apply noise reduction to the final RVC output.
Noise reduction intensity (0.0–1.0).
Output audio format. Choices:
WAV, MP3, FLAC, OGG, M4A.Speaker-embedding model for the RVC conversion step. Must match the embedder used when training the model. Choices:
contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom.Speaker ID for multi-speaker models.