Inference is the core operation of Applio: you feed it a source audio file, point it at a pre-trainedDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/IAHispano/Applio/llms.txt
Use this file to discover all available pages before exploring further.
.pth model and its companion .index file, and it returns a new audio file where the voice has been converted to sound like the target speaker. Under the hood, Applio loads the model, extracts the fundamental frequency (F0) pitch contour from your input using one of several algorithms (RMVPE, FCPE, CREPE, or hybrids of these), encodes the audio using a speaker-embedding model such as ContentVec, and finally synthesises the output through the HiFi-GAN vocoder. The result is a natural-sounding voice conversion that preserves the prosody of the original while matching the timbre of the trained model.
Single vs. Batch Inference
Applio supports two inference modes, both driven by the same underlying pipeline inrvc/infer/infer.py.
- Single inference — converts one audio file at a time. Use this for quick tests and fine-tuning your settings before a larger run.
- Batch inference — converts every compatible audio file in an input folder and writes the results to an output folder. Supported extensions include
wav,mp3,flac,ogg,opus,m4a,aac,alac,wma,aiff,webm, andac3.
Model files (
.pth) and index files (.index) are expected to live inside logs/<model_name>/. When you select a model in the UI, Applio automatically attempts to locate the matching index file using a fuzzy folder/name matching algorithm.Parameters
Shifts the output pitch in semitones. The valid range is -24 to +24. Positive values raise the pitch; negative values lower it. For male-to-female conversions, try values between +8 and +12.
The pitch extraction algorithm used to compute the F0 contour. Available choices:
Hybrid methods average the F0 curves from each constituent algorithm and can yield smoother results on challenging audio.
| Value | Notes |
|---|---|
rmvpe | Recommended default; accurate and fast |
fcpe | Fast; good for real-time use |
crepe | High quality but slower |
crepe-tiny | Lightweight version of crepe |
hybrid[crepe+rmvpe] | Blends crepe and rmvpe estimates |
hybrid[crepe+fcpe] | Blends crepe and fcpe estimates |
hybrid[rmvpe+fcpe] | Blends rmvpe and fcpe estimates |
hybrid[crepe+rmvpe+fcpe] | Blends all three estimates |
Controls how much influence the
.index file has on the output, on a scale of 0.0 to 1.0. Higher values push the output closer to the voice characteristics captured in the index. Lower values reduce index influence, which can help when the index introduces audible artefacts.Blends the volume envelope of the converted output on a scale of 0.0 to 1.0. A value of
1.0 uses the output’s own volume envelope entirely. Lower values blend in the original input’s envelope, which can be useful for preserving the dynamics of the source recording.Protects voiceless consonants and breath sounds from conversion artefacts on a scale of 0.0 to 0.5. A value of
0.5 provides the strongest protection. Reducing this value may lessen the protection but can also reduce over-indexing side effects.When enabled, Applio splits the input into smaller segments before inference and re-joins them afterwards. This can significantly improve quality on long recordings where silence handling matters.
Applies a light autotune to the inferred F0 curve, snapping pitches toward the nearest chromatic note. Particularly useful for singing voice conversions where in-tune output is important.
Controls how aggressively autotune snaps pitches to the chromatic grid on a scale of 0.0 to 1.0. A value of
1.0 gives full snapping; lower values allow more natural pitch variation to pass through.Runs a noise-reduction pass on the output audio using a noisereduce-based algorithm. Recommended for speech conversions where background noise in the input may bleed through.
Controls the intensity of the noise-reduction pass on a scale of 0.0 to 1.0. Higher values clean more aggressively but may compress the audio and reduce naturalness.
The container format for the output file. Choices:
WAV, MP3, FLAC, OGG, M4A.The speaker-embedding model used to encode the input audio before conversion. Choices:
contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom. The default contentvec works well for most voices. Language-specific Hubert models may give better results when training or inferring voices in those languages.Enables formant shifting on the input audio before conversion. Formant shifting alters the resonant frequencies of the vocal tract and is especially useful for male-to-female or female-to-male conversions where timbre differences are pronounced.
The quefrency (frequency) parameter for formant shifting. Values above
1.0 shift formants upward; values below 1.0 shift them downward. The slider range in the UI is 0.0 to 16.0.The timbre parameter for formant shifting, which controls the spectral envelope shape. Adjust alongside
formant_qfrency for natural-sounding formant adjustments.Master switch that enables the post-processing effects chain. Must be
true for any individual effect (reverb, chorus, distortion, etc.) to be applied. Each effect is still disabled by default and must be individually enabled even when post_process is true.Speaker ID for multi-speaker models. Most community models are single-speaker (ID
0), but models trained with multiple speakers expose additional IDs here.CLI — Single File
CLI — Batch Inference
Python API
You can call inference directly from Python usingrun_infer_script from core.py:
Model File Locations
Applio expects trained models to live underlogs/<model_name>/. A typical model directory looks like:
.pth and .index files under the logs/ tree and populates the dropdowns accordingly. Files prefixed with G_ or D_ (discriminator/generator checkpoints from training) are automatically excluded from the model list.