Training a custom voice model in Applio lets you build a personalised RVC model from scratch using your own audio recordings. The process follows a fixed four-stage pipeline: your raw audio is preprocessed and normalised, speaker embeddings and pitch features are extracted, the neural network is trained against pretrained weights, and finally a FAISS index file is generated so that converted audio can reference learned voice characteristics at inference time. Each stage is exposed both through the Gradio training tab and throughDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/IAHispano/Applio/llms.txt
Use this file to discover all available pages before exploring further.
core.py sub-commands, making it easy to automate or integrate into scripts.
Dataset Requirements
Before training, prepare a dataset of clean, single-speaker audio recordings. The recommended minimum is 10 minutes of audio, though 20–30 minutes generally produces noticeably better results. Supported formats includewav, mp3, flac, ogg, opus, m4a, mp4, aac, alac, wma, aiff, webm, and ac3. Key quality guidelines:
- Record or source audio at the same sample rate you plan to train at (32000, 40000, or 48000 Hz).
- Use a single speaker only — multi-speaker audio degrades training quality significantly.
- Avoid heavy reverb, music, or background noise; use clean, close-mic speech if possible.
- Remove silent or near-silent sections from the dataset before training.
The Four-Stage Pipeline
Preprocess
Applio reads your raw audio files, slices them into short segments, applies optional noise reduction and effects, resamples to the target rate, and saves two copies of each segment: one at the training sample rate and one downsampled to 16 kHz for feature extraction. The output lands in
logs/<model_name>/sliced_audios/ and logs/<model_name>/sliced_audios_16k/.Extract Features
Using the preprocessed 16 kHz audio, Applio extracts two types of features for every segment: an F0 pitch contour (using the chosen algorithm) and a speaker-embedding vector (using the chosen embedder model, e.g. ContentVec). These features are saved to
logs/<model_name>/ and used as training targets.Train
The RVC generator and discriminator networks are trained using the extracted features and the pretrained HiFi-GAN vocoder weights. Checkpoints are saved every
save_every_epoch epochs. The optional overtraining detector monitors validation loss and can stop training early to prevent quality degradation.Stage 1 — Preprocess Parameters
The name for your model. Applio creates and uses
logs/<model_name>/ as the working directory for all stages.Absolute or relative path to the folder containing your training audio files.
Target sample rate for the model. Must be one of
32000, 40000, or 48000 Hz. Use 48000 for higher-fidelity output if your source audio supports it.Number of CPU cores to use for parallel preprocessing. Defaults to the number of logical cores on your machine.
Controls how audio is segmented before preprocessing.
Automatic uses a silence-based slicer (Slicer) with a -42 dB threshold, 1500 ms minimum length, and 500 ms max silence. Other modes allow manual chunk lengths.When enabled, applies a high-pass filter (48 Hz cutoff, 5th-order Butterworth) to each segment before saving, reducing low-frequency rumble.
Applies spectral noise reduction (via
noisereduce) to the audio during preprocessing. Useful when the source dataset has consistent background noise.Intensity of noise reduction when
noise_reduction is enabled. Range is 0.0 to 1.0.Length in seconds for fixed-size chunk cutting (used when
cut_preprocess is not Automatic).Overlap in seconds between adjacent fixed-size chunks, to prevent abrupt cuts.
Audio normalisation applied to each segment.
none skips normalisation; post applies amplitude normalisation (targeting 0.675 of max amplitude with a 0.75 blend factor) after slicing.Stage 2 — Extract Parameters
Pitch extraction algorithm for computing F0 features. Choices:
crepe, crepe-tiny, rmvpe, fcpe. This should match the method used for inference later; rmvpe is recommended for general use.GPU index to use for feature extraction. Use
0 for the first (or only) GPU.Speaker-embedding model for encoding audio during feature extraction. Must match what you plan to use at inference. Choices:
contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom.Path to a custom embedder folder (required only when
embedder_model is custom).Number of silent/mute segments to include in the training data. These help the model learn to handle silence naturally.
Stage 3 — Train Parameters
How often (in epochs) to save a training checkpoint. Lower values give you more recovery points but use more disk space.
When enabled, only the most recent checkpoint is kept, deleting earlier ones to save disk space.
When enabled, a standalone
.pth weights file (suitable for inference) is exported at each checkpoint save, not just the raw training checkpoint.Total number of training epochs. Typical ranges are 200–800 depending on dataset size and quality. More data generally allows more epochs before overtraining.
Number of audio segments processed per training step. Larger batch sizes speed up training but require more VRAM. Reduce to
2 on GPUs with limited memory.Enables automatic detection of overtraining. When the validation loss stops improving for
overtraining_threshold consecutive evaluations, training is stopped early.Number of epochs without improvement in validation loss before the overtraining detector triggers early stopping.
When enabled, initialises training from official pretrained HiFi-GAN generator (
G) and discriminator (D) weights. Training from pretrained weights converges dramatically faster and produces better results than training from scratch.When enabled, removes intermediate training files (feature tensors, sliced audio) after training completes to free disk space.
Caches training tensors in GPU memory for faster iteration. Requires sufficient VRAM (typically an extra 2–4 GB depending on dataset size).
Vocoder used for audio synthesis during training. Choices:
HiFi-GAN, RefineGAN. HiFi-GAN is the default and is compatible with all clients. RefineGAN offers superior audio quality but requires the Applio client and matching pretrained weights.Stage 4 — Index Parameters
Algorithm used for building the FAISS index. Choices:
Auto— selects Faiss or KMeans automatically based on dataset sizeFaiss— builds a flat FAISS index; accurate but slower on large datasetsKMeans— uses KMeans clustering before indexing; more scalable for very large feature sets
CLI — Full Training Pipeline
The
train sub-command automatically calls run_index_script after training completes, so you do not need to run the index command separately unless you want to regenerate the index with different settings.Overtraining Detector
The overtraining detector monitors a running validation loss during training. Whenovertraining_detector is set to True, Applio tracks how many consecutive evaluation windows pass without the loss reaching a new minimum. Once this stall exceeds overtraining_threshold epochs, training stops automatically and the best checkpoint is preserved. This is especially useful for smaller datasets (under 10 minutes) where models can overfit quickly.
TensorBoard Monitoring
During training, Applio writes TensorBoard logs tologs/<model_name>/. You can launch TensorBoard via:
Output Files
After a successful training run, you will find the following key files:MyModel.pth and MyModel.index files are the two files needed for inference. Training checkpoints (G_ and D_ prefixed files) are only needed if you want to resume training from a specific epoch.