TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/IAHispano/Applio/llms.txt
Use this file to discover all available pages before exploring further.
infer and batch_infer subcommands are the core of Applio’s voice-conversion pipeline. infer processes a single input audio file and writes one converted output file, while batch_infer accepts a folder path and converts every audio file inside it in a single run. Both subcommands share an identical set of conversion flags; only the path arguments differ. All audio quality, pitch, effects, and embedding settings described below apply to both unless otherwise noted.
infer — single file
Useinfer when you want to convert one audio clip and inspect the result before committing to a full batch run, or when your workflow processes files one at a time.
Required flags
Full path to the input audio file to be converted.
Full path where the converted audio file will be saved. The file extension in this path should match
--export_format.Full path to the trained RVC model file (
.pth).Full path to the FAISS index file (
.index) that accompanies the model.Pitch and voice quality
Pitch shift in semitones. Accepted range:
-24 to 24. Positive values raise the pitch; negative values lower it. Use this to adapt a model trained on a different register to the target pitch.Controls how strongly the FAISS index influences the output. Range:
0.0–1.0. Higher values produce voice characteristics closer to the training data; lower values reduce index-related artifacts at the cost of some accuracy.Blends the output’s volume envelope with the input’s. A value of
1.0 applies the output envelope fully. Range: 0.0–1.0.Protects consonants and breath sounds from conversion artifacts. Range:
0.0–0.5. A value of 0.5 applies maximum protection; lower values reduce protection but can also reduce the indexing effect on those sounds.Pitch-extraction algorithm. Choices:
crepe, crepe-tiny, rmvpe, fcpe, hybrid[crepe+rmvpe], hybrid[crepe+fcpe], hybrid[rmvpe+fcpe], hybrid[crepe+rmvpe+fcpe]. rmvpe is recommended for most use cases; crepe variants may handle some voices better at the cost of speed; hybrid modes combine multiple extractors.Speaker ID for multi-speaker models. For single-speaker models this should remain
0.Audio processing
Split the input into smaller segments before inference. Recommended for long recordings (several minutes or more) to reduce memory usage and improve quality at segment boundaries. Accepts
True or False.Apply a light autotune to the converted output. Particularly useful for singing voice conversion. Accepts
True or False.Strength of the autotune snap to the chromatic grid. Range:
0.0–1.0. Only active when --f0_autotune True.Enable proposed pitch adjustment mode. Accepts
True or False.Threshold frequency (Hz) for the proposed pitch adjustment. Range:
50–1199. Only active when --proposed_pitch True.Run a noise-reduction pass on the output audio. Recommended for speech conversions. Accepts
True or False.Intensity of the noise-reduction pass. Range:
0.0–1.0. Higher values produce a cleaner but potentially more compressed output. Only active when --clean_audio True.Output file format. Choices:
WAV, MP3, FLAC, OGG, M4A.Embedder model
Speaker-embedding model used during inference. Choices:
contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom. Language-specific HuBERT models can improve quality for non-English voices.Path to a custom embedding model directory. Only used when
--embedder_model custom.Formant shifting
Apply formant shifting to the input before conversion. This adjusts vocal tract resonances independently of pitch and can be useful for more convincing gender conversions. Accepts
True or False.Quefrency factor for the formant shifting effect. Higher values produce a more pronounced shift. Only active when
--formant_shifting True.Timbre factor for the formant shifting effect. Higher values produce a more pronounced timbre change. Only active when
--formant_shifting True.Post-processing effects
All post-processing effects are inactive by default. Enable the--post_process True master switch first, then enable individual effects.
Enable the post-processing effects chain. When
False, all effect flags below are ignored. Accepts True or False.Enable reverb on the output. Accepts
True or False.Simulated room size for the reverb. Range:
0.0–1.0.High-frequency damping of the reverb tail. Range:
0.0–1.0.Level of the wet (reverb) signal. Range:
0.0–1.0.Level of the dry (direct) signal mixed with reverb. Range:
0.0–1.0.Stereo width of the reverb. Range:
0.0–1.0.Controls how much the reverb tail is frozen / infinitely sustained. Range:
0.0–1.0.Enable a secondary pitch shift applied after conversion. Accepts
True or False.Amount of post-processing pitch shift in semitones. Positive values increase pitch, negative values decrease it.
Enable a brickwall limiter on the output. Accepts
True or False.Limiter ceiling in dBFS. Output will not exceed this level.
Limiter release time in seconds.
Enable a gain stage on the output. Accepts
True or False.Gain to apply in dB. Positive values amplify; negative values attenuate.
Enable distortion on the output. Accepts
True or False.Drive amount for the distortion effect. Higher values produce more saturation.
Enable chorus on the output. Accepts
True or False.LFO rate of the chorus in Hz.
Modulation depth of the chorus.
Center delay time of the chorus in milliseconds.
Feedback amount of the chorus. Range:
0.0–1.0.Wet/dry mix of the chorus. Range:
0.0–1.0.Enable bit-depth reduction on the output. Accepts
True or False.Target bit depth for the bitcrush effect. Lower values produce a more degraded sound.
Enable hard clipping on the output. Accepts
True or False.Clipping ceiling in dBFS. Samples exceeding this threshold are hard-clipped.
Enable dynamic range compression on the output. Accepts
True or False.Threshold in dBFS above which compression is applied.
Compression ratio (input dB : output dB above threshold). A value of
4 means 4:1 compression.Compressor attack time in milliseconds.
Compressor release time in milliseconds.
Enable a delay (echo) effect on the output. Accepts
True or False.Delay time in seconds.
Feedback amount for the delay; controls how many echoes repeat. Range:
0.0–1.0.Wet/dry mix for the delay. Range:
0.0–1.0.batch_infer — folder of files
batch_infer accepts a folder path for input and a folder path for output, then converts every audio file found in the input folder. All conversion flags are identical to infer.
batch_infer–specific flags
Path to the folder containing the audio files to convert.
Path to the folder where converted audio files will be saved. The folder will be created if it does not exist.
--pth_path, --index_path, --pitch, --f0_method, etc.) are identical to the infer subcommand and carry the same defaults.
batch_infer does not have --input_path or --output_path; those are replaced by --input_folder and --output_folder.