Applio CLI tts: Text-to-Speech with Voice Conversion

The tts subcommand combines two operations into a single CLI call: it first synthesizes speech from text using Microsoft Edge TTS, saving the raw synthesized audio to --output_tts_path, and then immediately passes that audio through Applio’s RVC voice-conversion pipeline to produce a final converted output at --output_rvc_path. This means you can go from a plain text string to a fully converted voice clip — with all the quality controls available in normal inference — without any manual intermediate steps.

Two-stage pipeline

Text-to-speech synthesis

Edge TTS synthesizes the input text using the chosen voice (--tts_voice) at the specified rate (--tts_rate) and saves the raw audio to --output_tts_path.

RVC voice conversion

The synthesized audio is passed through the RVC pipeline using the model at --pth_path and index at --index_path. The converted audio is saved to --output_rvc_path.

--tts_text and --tts_file are both listed as required by the argument parser; provide the text directly in --tts_text and supply the path to a .txt file in --tts_file. If you only have one source, supply the same value to both or an empty string for the one you do not use — the underlying TTS script decides which to prefer at runtime.

Flags

Text input

--tts_text

string

required

The text string to synthesize. Enclose in quotes when passing on the command line. For long or multi-line content, use --tts_file instead.

--tts_file

string

required

Path to a plain-text file whose contents will be synthesized. Use this for long scripts or multi-line text.

Voice and rate

--tts_voice

string

required

Edge TTS voice short name to use for synthesis (e.g., en-US-AriaNeural, en-GB-SoniaNeural, es-ES-ElviraNeural). The full list of available voices is loaded from rvc/lib/tools/tts_voices.json; the relevant field is ShortName.

--tts_rate

integer

default:"0"

Speaking rate adjustment. Range: -100 (much slower) to 100 (much faster). A value of 0 uses the voice’s natural speaking rate.

Output paths

--output_tts_path

string

required

Full path where the raw Edge TTS audio will be saved before voice conversion. The file will be overwritten if it already exists inside the assets/ directory.

--output_rvc_path

string

required

Full path where the final RVC-converted audio will be saved.

Model paths

--pth_path

string

required

Full path to the trained RVC model file (.pth).

--index_path

string

required

Full path to the FAISS index file (.index) that accompanies the model.

Voice conversion settings

--pitch

integer

default:"0"

Pitch shift in semitones applied during RVC conversion. Range: -24 to 24. Useful for adapting a model trained on a different pitch register to the TTS voice’s natural pitch range.

--index_rate

float

default:"0.3"

Influence of the FAISS index on the conversion output. Range: 0.0–1.0. Higher values bring the output closer to the training voice; lower values reduce artifacts.

--volume_envelope

float

default:"1.0"

Blends the output’s volume envelope with the input’s. Range: 0.0–1.0. A value of 1.0 uses the output envelope fully.

--protect

float

default:"0.33"

Protects consonants and breathing sounds from conversion artifacts. Range: 0.0–0.5.

--f0_method

string

default:"rmvpe"

Pitch-extraction algorithm for RVC conversion. Choices: crepe, crepe-tiny, rmvpe, fcpe, hybrid[crepe+rmvpe], hybrid[crepe+fcpe], hybrid[rmvpe+fcpe], hybrid[crepe+rmvpe+fcpe]. rmvpe is recommended for most TTS voices.

--split_audio

boolean

default:"False"

Split the synthesized TTS audio into smaller segments before RVC conversion. Recommended for long TTS outputs. Accepts True or False.

--f0_autotune

boolean

default:"False"

Apply a light autotune to the RVC-converted output. Can help with TTS-to-singing use cases. Accepts True or False.

--f0_autotune_strength

float

default:"1.0"

Strength of the autotune snap to the chromatic grid. Range: 0.0–1.0. Only active when --f0_autotune True.

--proposed_pitch

boolean

default:"False"

Enable proposed pitch adjustment mode during conversion. Accepts True or False.

--proposed_pitch_threshold

float

default:"155.0"

Threshold frequency (Hz) for proposed pitch adjustment. Range: 100–499. Only active when --proposed_pitch True.

--clean_audio

boolean

default:"False"

Run noise reduction on the converted output. Recommended for cleaner TTS results. Accepts True or False.

--clean_strength

float

default:"0.7"

Intensity of the noise-reduction pass. Range: 0.0–1.0. Only active when --clean_audio True.

--export_format

string

default:"WAV"

Output file format for the final converted audio. Choices: WAV, MP3, FLAC, OGG, M4A.

--embedder_model

string

default:"contentvec"

Speaker-embedding model used during RVC conversion. Choices: contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom.

--embedder_model_custom

string

default:"None"

Path to a custom embedding model. Only used when --embedder_model custom.

Usage example

python core.py tts \
  --tts_text "Welcome to Applio voice conversion." \
  --tts_file "" \
  --tts_voice en-US-JennyNeural \
  --tts_rate 0 \
  --pth_path logs/MyModel/MyModel.pth \
  --index_path logs/MyModel/MyModel.index \
  --output_tts_path assets/tts_out.wav \
  --output_rvc_path assets/final_out.wav \
  --pitch 0 \
  --f0_method rmvpe \
  --index_rate 0.3 \
  --clean_audio True \
  --export_format WAV

Finding available voices

Applio loads its voice list from rvc/lib/tools/tts_voices.json. Each entry in the file represents one Edge TTS voice. The value you pass to --tts_voice must match the ShortName field exactly, including capitalization.

# Print all available voice short names
python -c "
import json
with open('rvc/lib/tools/tts_voices.json') as f:
    voices = json.load(f)
for v in sorted({v['ShortName'] for v in voices}):
    print(v)
"

Some commonly used voices:

Short name	Language	Gender
`en-US-AriaNeural`	English (US)	Female
`en-US-JennyNeural`	English (US)	Female
`en-US-GuyNeural`	English (US)	Male
`en-GB-SoniaNeural`	English (UK)	Female
`es-ES-ElviraNeural`	Spanish (Spain)	Female
`fr-FR-DeniseNeural`	French	Female
`de-DE-KatjaNeural`	German	Female
`ja-JP-NanamiNeural`	Japanese	Female
`zh-CN-XiaoxiaoNeural`	Chinese (Mandarin)	Female

The full Edge TTS voice catalog — including regional variants and styles — can be browsed at the Microsoft TTS voice gallery. Match the ShortName value shown there to what you pass to --tts_voice.

Commands

Applio CLI tts: Text-to-Speech with Voice Conversion

Two-stage pipeline

Flags

Text input

Voice and rate

Output paths

Model paths

Voice conversion settings

Usage example

Finding available voices

Build docs developers (and LLMs) love

Commands

Documentation Index

​Two-stage pipeline

​Flags

​Text input

​Voice and rate

​Output paths

​Model paths

​Voice conversion settings

​Usage example

​Finding available voices

Build docs developers (and LLMs) love

Two-stage pipeline

Flags

Text input

Voice and rate

Output paths

Model paths

Voice conversion settings

Usage example

Finding available voices