Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/IAHispano/Applio/llms.txt

Use this file to discover all available pages before exploring further.

The infer and batch_infer subcommands are the core of Applio’s voice-conversion pipeline. infer processes a single input audio file and writes one converted output file, while batch_infer accepts a folder path and converts every audio file inside it in a single run. Both subcommands share an identical set of conversion flags; only the path arguments differ. All audio quality, pitch, effects, and embedding settings described below apply to both unless otherwise noted.

infer — single file

Use infer when you want to convert one audio clip and inspect the result before committing to a full batch run, or when your workflow processes files one at a time.

Required flags

--input_path
string
required
Full path to the input audio file to be converted.
--output_path
string
required
Full path where the converted audio file will be saved. The file extension in this path should match --export_format.
--pth_path
string
required
Full path to the trained RVC model file (.pth).
--index_path
string
required
Full path to the FAISS index file (.index) that accompanies the model.

Pitch and voice quality

--pitch
integer
default:"0"
Pitch shift in semitones. Accepted range: -24 to 24. Positive values raise the pitch; negative values lower it. Use this to adapt a model trained on a different register to the target pitch.
--index_rate
float
default:"0.3"
Controls how strongly the FAISS index influences the output. Range: 0.01.0. Higher values produce voice characteristics closer to the training data; lower values reduce index-related artifacts at the cost of some accuracy.
--volume_envelope
float
default:"1.0"
Blends the output’s volume envelope with the input’s. A value of 1.0 applies the output envelope fully. Range: 0.01.0.
--protect
float
default:"0.33"
Protects consonants and breath sounds from conversion artifacts. Range: 0.00.5. A value of 0.5 applies maximum protection; lower values reduce protection but can also reduce the indexing effect on those sounds.
--f0_method
string
default:"rmvpe"
Pitch-extraction algorithm. Choices: crepe, crepe-tiny, rmvpe, fcpe, hybrid[crepe+rmvpe], hybrid[crepe+fcpe], hybrid[rmvpe+fcpe], hybrid[crepe+rmvpe+fcpe]. rmvpe is recommended for most use cases; crepe variants may handle some voices better at the cost of speed; hybrid modes combine multiple extractors.
--sid
integer
default:"0"
Speaker ID for multi-speaker models. For single-speaker models this should remain 0.

Audio processing

--split_audio
boolean
default:"False"
Split the input into smaller segments before inference. Recommended for long recordings (several minutes or more) to reduce memory usage and improve quality at segment boundaries. Accepts True or False.
--f0_autotune
boolean
default:"False"
Apply a light autotune to the converted output. Particularly useful for singing voice conversion. Accepts True or False.
--f0_autotune_strength
float
default:"1.0"
Strength of the autotune snap to the chromatic grid. Range: 0.01.0. Only active when --f0_autotune True.
--proposed_pitch
boolean
default:"False"
Enable proposed pitch adjustment mode. Accepts True or False.
--proposed_pitch_threshold
float
default:"155.0"
Threshold frequency (Hz) for the proposed pitch adjustment. Range: 501199. Only active when --proposed_pitch True.
--clean_audio
boolean
default:"False"
Run a noise-reduction pass on the output audio. Recommended for speech conversions. Accepts True or False.
--clean_strength
float
default:"0.7"
Intensity of the noise-reduction pass. Range: 0.01.0. Higher values produce a cleaner but potentially more compressed output. Only active when --clean_audio True.
--export_format
string
default:"WAV"
Output file format. Choices: WAV, MP3, FLAC, OGG, M4A.

Embedder model

--embedder_model
string
default:"contentvec"
Speaker-embedding model used during inference. Choices: contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom. Language-specific HuBERT models can improve quality for non-English voices.
--embedder_model_custom
string
default:"None"
Path to a custom embedding model directory. Only used when --embedder_model custom.

Formant shifting

--formant_shifting
boolean
default:"False"
Apply formant shifting to the input before conversion. This adjusts vocal tract resonances independently of pitch and can be useful for more convincing gender conversions. Accepts True or False.
--formant_qfrency
float
default:"1.0"
Quefrency factor for the formant shifting effect. Higher values produce a more pronounced shift. Only active when --formant_shifting True.
--formant_timbre
float
default:"1.0"
Timbre factor for the formant shifting effect. Higher values produce a more pronounced timbre change. Only active when --formant_shifting True.

Post-processing effects

All post-processing effects are inactive by default. Enable the --post_process True master switch first, then enable individual effects.
--post_process
boolean
default:"False"
Enable the post-processing effects chain. When False, all effect flags below are ignored. Accepts True or False.
Reverb
--reverb
boolean
default:"False"
Enable reverb on the output. Accepts True or False.
--reverb_room_size
float
default:"0.5"
Simulated room size for the reverb. Range: 0.01.0.
--reverb_damping
float
default:"0.5"
High-frequency damping of the reverb tail. Range: 0.01.0.
--reverb_wet_gain
float
default:"0.5"
Level of the wet (reverb) signal. Range: 0.01.0.
--reverb_dry_gain
float
default:"0.5"
Level of the dry (direct) signal mixed with reverb. Range: 0.01.0.
--reverb_width
float
default:"0.5"
Stereo width of the reverb. Range: 0.01.0.
--reverb_freeze_mode
float
default:"0.5"
Controls how much the reverb tail is frozen / infinitely sustained. Range: 0.01.0.
Pitch shift (post-processing)
--pitch_shift
boolean
default:"False"
Enable a secondary pitch shift applied after conversion. Accepts True or False.
--pitch_shift_semitones
float
default:"0.0"
Amount of post-processing pitch shift in semitones. Positive values increase pitch, negative values decrease it.
Limiter
--limiter
boolean
default:"False"
Enable a brickwall limiter on the output. Accepts True or False.
--limiter_threshold
float
default:"-6"
Limiter ceiling in dBFS. Output will not exceed this level.
--limiter_release_time
float
default:"0.01"
Limiter release time in seconds.
Gain
--gain
boolean
default:"False"
Enable a gain stage on the output. Accepts True or False.
--gain_db
float
default:"0.0"
Gain to apply in dB. Positive values amplify; negative values attenuate.
Distortion
--distortion
boolean
default:"False"
Enable distortion on the output. Accepts True or False.
--distortion_gain
float
default:"25"
Drive amount for the distortion effect. Higher values produce more saturation.
Chorus
--chorus
boolean
default:"False"
Enable chorus on the output. Accepts True or False.
--chorus_rate
float
default:"1.0"
LFO rate of the chorus in Hz.
--chorus_depth
float
default:"0.25"
Modulation depth of the chorus.
--chorus_center_delay
float
default:"7"
Center delay time of the chorus in milliseconds.
--chorus_feedback
float
default:"0.0"
Feedback amount of the chorus. Range: 0.01.0.
--chorus_mix
float
default:"0.5"
Wet/dry mix of the chorus. Range: 0.01.0.
Bitcrush
--bitcrush
boolean
default:"False"
Enable bit-depth reduction on the output. Accepts True or False.
--bitcrush_bit_depth
integer
default:"8"
Target bit depth for the bitcrush effect. Lower values produce a more degraded sound.
Clipping
--clipping
boolean
default:"False"
Enable hard clipping on the output. Accepts True or False.
--clipping_threshold
float
default:"-6"
Clipping ceiling in dBFS. Samples exceeding this threshold are hard-clipped.
Compressor
--compressor
boolean
default:"False"
Enable dynamic range compression on the output. Accepts True or False.
--compressor_threshold
float
default:"0"
Threshold in dBFS above which compression is applied.
--compressor_ratio
float
default:"1"
Compression ratio (input dB : output dB above threshold). A value of 4 means 4:1 compression.
--compressor_attack
float
default:"1.0"
Compressor attack time in milliseconds.
--compressor_release
float
default:"100"
Compressor release time in milliseconds.
Delay
--delay
boolean
default:"False"
Enable a delay (echo) effect on the output. Accepts True or False.
--delay_seconds
float
default:"0.5"
Delay time in seconds.
--delay_feedback
float
default:"0.0"
Feedback amount for the delay; controls how many echoes repeat. Range: 0.01.0.
--delay_mix
float
default:"0.5"
Wet/dry mix for the delay. Range: 0.01.0.

batch_infer — folder of files

batch_infer accepts a folder path for input and a folder path for output, then converts every audio file found in the input folder. All conversion flags are identical to infer.

batch_infer–specific flags

--input_folder
string
required
Path to the folder containing the audio files to convert.
--output_folder
string
required
Path to the folder where converted audio files will be saved. The folder will be created if it does not exist.
All other flags (--pth_path, --index_path, --pitch, --f0_method, etc.) are identical to the infer subcommand and carry the same defaults.
batch_infer does not have --input_path or --output_path; those are replaced by --input_folder and --output_folder.

Usage examples

Single-file conversion

python core.py infer \
  --input_path audio/recording.wav \
  --output_path audio/converted.wav \
  --pth_path logs/MyModel/MyModel.pth \
  --index_path logs/MyModel/MyModel.index \
  --pitch 0 \
  --f0_method rmvpe \
  --index_rate 0.3 \
  --export_format WAV

Single-file conversion with post-processing

python core.py infer \
  --input_path audio/recording.wav \
  --output_path audio/converted_with_fx.wav \
  --pth_path logs/MyModel/MyModel.pth \
  --index_path logs/MyModel/MyModel.index \
  --pitch 2 \
  --f0_method rmvpe \
  --split_audio True \
  --clean_audio True \
  --clean_strength 0.5 \
  --post_process True \
  --reverb True \
  --reverb_room_size 0.4 \
  --reverb_wet_gain 0.3 \
  --reverb_dry_gain 0.7 \
  --export_format FLAC

Batch conversion

python core.py batch_infer \
  --input_folder audio/raw_clips/ \
  --output_folder audio/converted_clips/ \
  --pth_path logs/MyModel/MyModel.pth \
  --index_path logs/MyModel/MyModel.index \
  --pitch 0 \
  --f0_method rmvpe \
  --index_rate 0.3 \
  --export_format WAV
For large batches, enable --split_audio True to reduce memory pressure on long files and --clean_audio True to automatically reduce noise in the output clips.

Build docs developers (and LLMs) love