Applio CLI infer and batch_infer: Voice Conversion

The infer and batch_infer subcommands are the core of Applio’s voice-conversion pipeline. infer processes a single input audio file and writes one converted output file, while batch_infer accepts a folder path and converts every audio file inside it in a single run. Both subcommands share an identical set of conversion flags; only the path arguments differ. All audio quality, pitch, effects, and embedding settings described below apply to both unless otherwise noted.

infer — single file

Use infer when you want to convert one audio clip and inspect the result before committing to a full batch run, or when your workflow processes files one at a time.

Required flags

--input_path

string

required

Full path to the input audio file to be converted.

--output_path

string

required

Full path where the converted audio file will be saved. The file extension in this path should match --export_format.

--pth_path

string

required

Full path to the trained RVC model file (.pth).

--index_path

string

required

Full path to the FAISS index file (.index) that accompanies the model.

Pitch and voice quality

--pitch

integer

default:"0"

Pitch shift in semitones. Accepted range: -24 to 24. Positive values raise the pitch; negative values lower it. Use this to adapt a model trained on a different register to the target pitch.

--index_rate

float

default:"0.3"

Controls how strongly the FAISS index influences the output. Range: 0.0–1.0. Higher values produce voice characteristics closer to the training data; lower values reduce index-related artifacts at the cost of some accuracy.

--volume_envelope

float

default:"1.0"

Blends the output’s volume envelope with the input’s. A value of 1.0 applies the output envelope fully. Range: 0.0–1.0.

--protect

float

default:"0.33"

Protects consonants and breath sounds from conversion artifacts. Range: 0.0–0.5. A value of 0.5 applies maximum protection; lower values reduce protection but can also reduce the indexing effect on those sounds.

--f0_method

string

default:"rmvpe"

Pitch-extraction algorithm. Choices: crepe, crepe-tiny, rmvpe, fcpe, hybrid[crepe+rmvpe], hybrid[crepe+fcpe], hybrid[rmvpe+fcpe], hybrid[crepe+rmvpe+fcpe]. rmvpe is recommended for most use cases; crepe variants may handle some voices better at the cost of speed; hybrid modes combine multiple extractors.

--sid

integer

default:"0"

Speaker ID for multi-speaker models. For single-speaker models this should remain 0.

Audio processing

--split_audio

boolean

default:"False"

Split the input into smaller segments before inference. Recommended for long recordings (several minutes or more) to reduce memory usage and improve quality at segment boundaries. Accepts True or False.

--f0_autotune

boolean

default:"False"

Apply a light autotune to the converted output. Particularly useful for singing voice conversion. Accepts True or False.

--f0_autotune_strength

float

default:"1.0"

Strength of the autotune snap to the chromatic grid. Range: 0.0–1.0. Only active when --f0_autotune True.

--proposed_pitch

boolean

default:"False"

Enable proposed pitch adjustment mode. Accepts True or False.

--proposed_pitch_threshold

float

default:"155.0"

Threshold frequency (Hz) for the proposed pitch adjustment. Range: 50–1199. Only active when --proposed_pitch True.

--clean_audio

boolean

default:"False"

Run a noise-reduction pass on the output audio. Recommended for speech conversions. Accepts True or False.

--clean_strength

float

default:"0.7"

Intensity of the noise-reduction pass. Range: 0.0–1.0. Higher values produce a cleaner but potentially more compressed output. Only active when --clean_audio True.

--export_format

string

default:"WAV"

Output file format. Choices: WAV, MP3, FLAC, OGG, M4A.

Embedder model

--embedder_model

string

default:"contentvec"

Speaker-embedding model used during inference. Choices: contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom. Language-specific HuBERT models can improve quality for non-English voices.

--embedder_model_custom

string

default:"None"

Path to a custom embedding model directory. Only used when --embedder_model custom.

Formant shifting

--formant_shifting

boolean

default:"False"

Apply formant shifting to the input before conversion. This adjusts vocal tract resonances independently of pitch and can be useful for more convincing gender conversions. Accepts True or False.

--formant_qfrency

float

default:"1.0"

Quefrency factor for the formant shifting effect. Higher values produce a more pronounced shift. Only active when --formant_shifting True.

--formant_timbre

float

default:"1.0"

Timbre factor for the formant shifting effect. Higher values produce a more pronounced timbre change. Only active when --formant_shifting True.

Post-processing effects

All post-processing effects are inactive by default. Enable the --post_process True master switch first, then enable individual effects.

--post_process

boolean

default:"False"

Enable the post-processing effects chain. When False, all effect flags below are ignored. Accepts True or False.

Reverb

--reverb

boolean

default:"False"

Enable reverb on the output. Accepts True or False.

--reverb_room_size

float

default:"0.5"

Simulated room size for the reverb. Range: 0.0–1.0.

--reverb_damping

float

default:"0.5"

High-frequency damping of the reverb tail. Range: 0.0–1.0.

--reverb_wet_gain

float

default:"0.5"

Level of the wet (reverb) signal. Range: 0.0–1.0.

--reverb_dry_gain

float

default:"0.5"

Level of the dry (direct) signal mixed with reverb. Range: 0.0–1.0.

--reverb_width

float

default:"0.5"

Stereo width of the reverb. Range: 0.0–1.0.

--reverb_freeze_mode

float

default:"0.5"

Controls how much the reverb tail is frozen / infinitely sustained. Range: 0.0–1.0.

Pitch shift (post-processing)

--pitch_shift

boolean

default:"False"

Enable a secondary pitch shift applied after conversion. Accepts True or False.

--pitch_shift_semitones

float

default:"0.0"

Amount of post-processing pitch shift in semitones. Positive values increase pitch, negative values decrease it.

Limiter

--limiter

boolean

default:"False"

Enable a brickwall limiter on the output. Accepts True or False.

--limiter_threshold

float

default:"-6"

Limiter ceiling in dBFS. Output will not exceed this level.

--limiter_release_time

float

default:"0.01"

Limiter release time in seconds.

Gain

--gain

boolean

default:"False"

Enable a gain stage on the output. Accepts True or False.

--gain_db

float

default:"0.0"

Gain to apply in dB. Positive values amplify; negative values attenuate.

Distortion

--distortion

boolean

default:"False"

Enable distortion on the output. Accepts True or False.

--distortion_gain

float

default:"25"

Drive amount for the distortion effect. Higher values produce more saturation.

Chorus

--chorus

boolean

default:"False"

Enable chorus on the output. Accepts True or False.

--chorus_rate

float

default:"1.0"

LFO rate of the chorus in Hz.

--chorus_depth

float

default:"0.25"

Modulation depth of the chorus.

--chorus_center_delay

float

default:"7"

Center delay time of the chorus in milliseconds.

--chorus_feedback

float

default:"0.0"

Feedback amount of the chorus. Range: 0.0–1.0.

--chorus_mix

float

default:"0.5"

Wet/dry mix of the chorus. Range: 0.0–1.0.

Bitcrush

--bitcrush

boolean

default:"False"

Enable bit-depth reduction on the output. Accepts True or False.

--bitcrush_bit_depth

integer

default:"8"

Target bit depth for the bitcrush effect. Lower values produce a more degraded sound.

Clipping

--clipping

boolean

default:"False"

Enable hard clipping on the output. Accepts True or False.

--clipping_threshold

float

default:"-6"

Clipping ceiling in dBFS. Samples exceeding this threshold are hard-clipped.

Compressor

--compressor

boolean

default:"False"

Enable dynamic range compression on the output. Accepts True or False.

--compressor_threshold

float

default:"0"

Threshold in dBFS above which compression is applied.

--compressor_ratio

float

default:"1"

Compression ratio (input dB : output dB above threshold). A value of 4 means 4:1 compression.

--compressor_attack

float

default:"1.0"

Compressor attack time in milliseconds.

--compressor_release

float

default:"100"

Compressor release time in milliseconds.

Delay

--delay

boolean

default:"False"

Enable a delay (echo) effect on the output. Accepts True or False.

--delay_seconds

float

default:"0.5"

Delay time in seconds.

--delay_feedback

float

default:"0.0"

Feedback amount for the delay; controls how many echoes repeat. Range: 0.0–1.0.

--delay_mix

float

default:"0.5"

Wet/dry mix for the delay. Range: 0.0–1.0.

batch_infer — folder of files

batch_infer accepts a folder path for input and a folder path for output, then converts every audio file found in the input folder. All conversion flags are identical to infer.

batch_infer–specific flags

--input_folder

string

required

Path to the folder containing the audio files to convert.

--output_folder

string

required

Path to the folder where converted audio files will be saved. The folder will be created if it does not exist.

All other flags (--pth_path, --index_path, --pitch, --f0_method, etc.) are identical to the infer subcommand and carry the same defaults.

batch_infer does not have --input_path or --output_path; those are replaced by --input_folder and --output_folder.

Usage examples

Single-file conversion

python core.py infer \
  --input_path audio/recording.wav \
  --output_path audio/converted.wav \
  --pth_path logs/MyModel/MyModel.pth \
  --index_path logs/MyModel/MyModel.index \
  --pitch 0 \
  --f0_method rmvpe \
  --index_rate 0.3 \
  --export_format WAV

Single-file conversion with post-processing

python core.py infer \
  --input_path audio/recording.wav \
  --output_path audio/converted_with_fx.wav \
  --pth_path logs/MyModel/MyModel.pth \
  --index_path logs/MyModel/MyModel.index \
  --pitch 2 \
  --f0_method rmvpe \
  --split_audio True \
  --clean_audio True \
  --clean_strength 0.5 \
  --post_process True \
  --reverb True \
  --reverb_room_size 0.4 \
  --reverb_wet_gain 0.3 \
  --reverb_dry_gain 0.7 \
  --export_format FLAC

Batch conversion

python core.py batch_infer \
  --input_folder audio/raw_clips/ \
  --output_folder audio/converted_clips/ \
  --pth_path logs/MyModel/MyModel.pth \
  --index_path logs/MyModel/MyModel.index \
  --pitch 0 \
  --f0_method rmvpe \
  --index_rate 0.3 \
  --export_format WAV

For large batches, enable --split_audio True to reduce memory pressure on long files and --clean_audio True to automatically reduce noise in the output clips.

Commands

Applio CLI infer and batch_infer: Voice Conversion

infer — single file

Required flags

Pitch and voice quality

Audio processing

Embedder model

Formant shifting

Post-processing effects

batch_infer — folder of files

batch_infer–specific flags

Usage examples

Single-file conversion

Single-file conversion with post-processing

Batch conversion

Build docs developers (and LLMs) love

Commands

Documentation Index

​infer — single file

​Required flags

​Pitch and voice quality

​Audio processing

​Embedder model

​Formant shifting

​Post-processing effects

​batch_infer — folder of files

​batch_infer–specific flags

​Usage examples

​Single-file conversion

​Single-file conversion with post-processing

​Batch conversion

Build docs developers (and LLMs) love

infer — single file

Required flags

Pitch and voice quality

Audio processing

Embedder model

Formant shifting

Post-processing effects

batch_infer — folder of files

batch_infer–specific flags

Usage examples

Single-file conversion

Single-file conversion with post-processing

Batch conversion