Train a Custom Voice Model in Applio

Training a custom voice model in Applio lets you build a personalised RVC model from scratch using your own audio recordings. The process follows a fixed four-stage pipeline: your raw audio is preprocessed and normalised, speaker embeddings and pitch features are extracted, the neural network is trained against pretrained weights, and finally a FAISS index file is generated so that converted audio can reference learned voice characteristics at inference time. Each stage is exposed both through the Gradio training tab and through core.py sub-commands, making it easy to automate or integrate into scripts.

Dataset Requirements

Before training, prepare a dataset of clean, single-speaker audio recordings. The recommended minimum is 10 minutes of audio, though 20–30 minutes generally produces noticeably better results. Supported formats include wav, mp3, flac, ogg, opus, m4a, mp4, aac, alac, wma, aiff, webm, and ac3. Key quality guidelines:

Record or source audio at the same sample rate you plan to train at (32000, 40000, or 48000 Hz).
Use a single speaker only — multi-speaker audio degrades training quality significantly.
Avoid heavy reverb, music, or background noise; use clean, close-mic speech if possible.
Remove silent or near-silent sections from the dataset before training.

Place your dataset files inside assets/datasets/<dataset_name>/ so the training tab can discover them automatically via the dataset browser.

The Four-Stage Pipeline

Preprocess

Applio reads your raw audio files, slices them into short segments, applies optional noise reduction and effects, resamples to the target rate, and saves two copies of each segment: one at the training sample rate and one downsampled to 16 kHz for feature extraction. The output lands in logs/<model_name>/sliced_audios/ and logs/<model_name>/sliced_audios_16k/.

Extract Features

Using the preprocessed 16 kHz audio, Applio extracts two types of features for every segment: an F0 pitch contour (using the chosen algorithm) and a speaker-embedding vector (using the chosen embedder model, e.g. ContentVec). These features are saved to logs/<model_name>/ and used as training targets.

Train

The RVC generator and discriminator networks are trained using the extracted features and the pretrained HiFi-GAN vocoder weights. Checkpoints are saved every save_every_epoch epochs. The optional overtraining detector monitors validation loss and can stop training early to prevent quality degradation.

Generate Index

After training completes, a FAISS (or KMeans) index is built from the extracted feature vectors. This index is used at inference time to look up the nearest training features, improving voice fidelity. The index file is saved as logs/<model_name>/<model_name>.index.

Stage 1 — Preprocess Parameters

model_name

str

required

The name for your model. Applio creates and uses logs/<model_name>/ as the working directory for all stages.

dataset_path

str

required

Absolute or relative path to the folder containing your training audio files.

sample_rate

int

default:"40000"

Target sample rate for the model. Must be one of 32000, 40000, or 48000 Hz. Use 48000 for higher-fidelity output if your source audio supports it.

cpu_cores

int

Number of CPU cores to use for parallel preprocessing. Defaults to the number of logical cores on your machine.

cut_preprocess

str

default:"Automatic"

Controls how audio is segmented before preprocessing. Automatic uses a silence-based slicer (Slicer) with a -42 dB threshold, 1500 ms minimum length, and 500 ms max silence. Other modes allow manual chunk lengths.

process_effects

bool

default:"false"

When enabled, applies a high-pass filter (48 Hz cutoff, 5th-order Butterworth) to each segment before saving, reducing low-frequency rumble.

noise_reduction

bool

default:"false"

Applies spectral noise reduction (via noisereduce) to the audio during preprocessing. Useful when the source dataset has consistent background noise.

clean_strength

float

default:"0.7"

Intensity of noise reduction when noise_reduction is enabled. Range is 0.0 to 1.0.

chunk_len

float

Length in seconds for fixed-size chunk cutting (used when cut_preprocess is not Automatic).

overlap_len

float

Overlap in seconds between adjacent fixed-size chunks, to prevent abrupt cuts.

normalization_mode

str

default:"none"

Audio normalisation applied to each segment. none skips normalisation; post applies amplitude normalisation (targeting 0.675 of max amplitude with a 0.75 blend factor) after slicing.

Stage 2 — Extract Parameters

f0_method

str

default:"rmvpe"

Pitch extraction algorithm for computing F0 features. Choices: crepe, crepe-tiny, rmvpe, fcpe. This should match the method used for inference later; rmvpe is recommended for general use.

gpu

int

default:"0"

GPU index to use for feature extraction. Use 0 for the first (or only) GPU.

embedder_model

str

default:"contentvec"

Speaker-embedding model for encoding audio during feature extraction. Must match what you plan to use at inference. Choices: contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom.

embedder_model_custom

str

Path to a custom embedder folder (required only when embedder_model is custom).

include_mutes

int

default:"2"

Number of silent/mute segments to include in the training data. These help the model learn to handle silence naturally.

Stage 3 — Train Parameters

save_every_epoch

int

default:"10"

How often (in epochs) to save a training checkpoint. Lower values give you more recovery points but use more disk space.

save_only_latest

bool

default:"true"

When enabled, only the most recent checkpoint is kept, deleting earlier ones to save disk space.

save_every_weights

bool

default:"true"

When enabled, a standalone .pth weights file (suitable for inference) is exported at each checkpoint save, not just the raw training checkpoint.

total_epoch

int

default:"200"

Total number of training epochs. Typical ranges are 200–800 depending on dataset size and quality. More data generally allows more epochs before overtraining.

batch_size

int

default:"4"

Number of audio segments processed per training step. Larger batch sizes speed up training but require more VRAM. Reduce to 2 on GPUs with limited memory.

overtraining_detector

bool

default:"false"

Enables automatic detection of overtraining. When the validation loss stops improving for overtraining_threshold consecutive evaluations, training is stopped early.

overtraining_threshold

int

default:"50"

Number of epochs without improvement in validation loss before the overtraining detector triggers early stopping.

pretrained

bool

default:"true"

When enabled, initialises training from official pretrained HiFi-GAN generator (G) and discriminator (D) weights. Training from pretrained weights converges dramatically faster and produces better results than training from scratch.

cleanup

bool

default:"false"

When enabled, removes intermediate training files (feature tensors, sliced audio) after training completes to free disk space.

cache_data_in_gpu

bool

default:"false"

Caches training tensors in GPU memory for faster iteration. Requires sufficient VRAM (typically an extra 2–4 GB depending on dataset size).

vocoder

str

default:"HiFi-GAN"

Vocoder used for audio synthesis during training. Choices: HiFi-GAN, RefineGAN. HiFi-GAN is the default and is compatible with all clients. RefineGAN offers superior audio quality but requires the Applio client and matching pretrained weights.

Stage 4 — Index Parameters

index_algorithm

str

default:"Auto"

Algorithm used for building the FAISS index. Choices:

Auto — selects Faiss or KMeans automatically based on dataset size
Faiss — builds a flat FAISS index; accurate but slower on large datasets
KMeans — uses KMeans clustering before indexing; more scalable for very large feature sets

CLI — Full Training Pipeline

# Stage 1: Preprocess
python core.py preprocess \
  --model_name MyModel \
  --dataset_path assets/datasets/MyDataset \
  --sample_rate 40000 \
  --cut_preprocess Automatic \
  --process_effects False \
  --noise_reduction False \
  --noise_reduction_strength 0.7 \
  --chunk_len 3.0 \
  --overlap_len 0.3 \
  --normalization_mode none

# Stage 2: Extract Features
python core.py extract \
  --model_name MyModel \
  --f0_method rmvpe \
  --gpu 0 \
  --sample_rate 40000 \
  --embedder_model contentvec \
  --include_mutes 2

# Stage 3: Train (also runs index generation automatically)
python core.py train \
  --model_name MyModel \
  --save_every_epoch 10 \
  --save_only_latest True \
  --save_every_weights True \
  --total_epoch 200 \
  --sample_rate 40000 \
  --batch_size 4 \
  --gpu 0 \
  --overtraining_detector False \
  --overtraining_threshold 50 \
  --pretrained True \
  --cleanup False \
  --index_algorithm Auto \
  --cache_data_in_gpu False \
  --vocoder HiFi-GAN

The train sub-command automatically calls run_index_script after training completes, so you do not need to run the index command separately unless you want to regenerate the index with different settings.

Overtraining Detector

The overtraining detector monitors a running validation loss during training. When overtraining_detector is set to True, Applio tracks how many consecutive evaluation windows pass without the loss reaching a new minimum. Once this stall exceeds overtraining_threshold epochs, training stops automatically and the best checkpoint is preserved. This is especially useful for smaller datasets (under 10 minutes) where models can overfit quickly.

Overtraining on a small dataset produces a model that mimics the training audio very literally — it may struggle to generalise to new speaking styles or pitches. Aim for at least 10 minutes of varied, expressive recordings to reduce this risk.

TensorBoard Monitoring

During training, Applio writes TensorBoard logs to logs/<model_name>/. You can launch TensorBoard via:

python core.py tensorboard

This opens a local TensorBoard server where you can track generator/discriminator losses, mel-spectrogram comparisons, and audio samples across epochs.

Output Files

After a successful training run, you will find the following key files:

logs/
└── MyModel/
    ├── MyModel.pth           ← final inference-ready model weights
    ├── MyModel.index         ← FAISS feature index for inference
    ├── G_<epoch>.pth         ← generator training checkpoint(s)
    └── D_<epoch>.pth         ← discriminator training checkpoint(s)

The MyModel.pth and MyModel.index files are the two files needed for inference. Training checkpoints (G_ and D_ prefixed files) are only needed if you want to resume training from a specific epoch.

Get Started

Core Features

Advanced Usage

Deployment

Train a Custom Voice Model in Applio

Dataset Requirements

The Four-Stage Pipeline

Stage 1 — Preprocess Parameters

Stage 2 — Extract Parameters

Stage 3 — Train Parameters

Stage 4 — Index Parameters

CLI — Full Training Pipeline

Overtraining Detector

TensorBoard Monitoring

Output Files

Build docs developers (and LLMs) love

Get Started

Core Features

Advanced Usage

Deployment

Documentation Index

​Dataset Requirements

​The Four-Stage Pipeline

​Stage 1 — Preprocess Parameters

​Stage 2 — Extract Parameters

​Stage 3 — Train Parameters

​Stage 4 — Index Parameters

​CLI — Full Training Pipeline

​Overtraining Detector

​TensorBoard Monitoring

​Output Files

Build docs developers (and LLMs) love

Dataset Requirements

The Four-Stage Pipeline

Stage 1 — Preprocess Parameters

Stage 2 — Extract Parameters

Stage 3 — Train Parameters

Stage 4 — Index Parameters

CLI — Full Training Pipeline

Overtraining Detector

TensorBoard Monitoring

Output Files