Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/IAHispano/Applio/llms.txt

Use this file to discover all available pages before exploring further.

Training a custom RVC voice model in Applio is a four-stage pipeline. Each stage is exposed as a separate subcommand in core.py, which lets you inspect intermediate results, re-run individual stages without repeating earlier work, and parallelize or schedule stages independently. The stages must be run in order: preprocessing cleans and segments raw audio; extraction computes pitch curves and speaker embeddings; training optimizes the model weights; and index generation builds a FAISS index for retrieval-based conversion. The train subcommand automatically triggers index generation on successful completion, but you can also run index separately at any time.
All four commands must use the same --model_name and --sample_rate values. Mixing these across stages will cause training failures or poor-quality models.

Complete pipeline

1

Preprocess the dataset

Slice, filter, and normalize your raw audio files so they are ready for feature extraction.
python core.py preprocess \
  --model_name MyVoice \
  --dataset_path /data/voice_samples \
  --sample_rate 40000 \
  --cpu_cores 4 \
  --cut_preprocess Automatic
2

Extract features

Compute F0 pitch curves and speaker embeddings from the preprocessed data.
python core.py extract \
  --model_name MyVoice \
  --f0_method rmvpe \
  --gpu 0 \
  --sample_rate 40000 \
  --embedder_model contentvec \
  --include_mutes 2
3

Train the model

Optimize model weights. Index generation runs automatically at the end.
python core.py train \
  --model_name MyVoice \
  --total_epoch 200 \
  --save_every_epoch 10 \
  --batch_size 8 \
  --gpu 0 \
  --sample_rate 40000 \
  --overtraining_detector True \
  --overtraining_threshold 50
4

(Optional) Regenerate the index

Manually re-run FAISS index generation, for example if you want to change the algorithm after training.
python core.py index \
  --model_name MyVoice \
  --index_algorithm Auto

preprocess

The preprocess stage reads raw audio files from --dataset_path, slices them into manageable chunks, applies optional noise reduction and normalization, and writes the processed segments into logs/<model_name>/. The quality of this step directly affects the quality of the trained model.

Flags

--model_name
string
required
Name of the model. Determines the subdirectory under logs/ where preprocessed data is stored. Must match the names used in subsequent pipeline stages.
--dataset_path
string
required
Path to the directory containing raw training audio files. Applio accepts common formats (WAV, MP3, FLAC, etc.).
--sample_rate
integer
required
Target sample rate for all processed audio. Choices: 32000, 40000, 48000. All files will be resampled to this rate. Choose a rate that matches the vocoder you plan to use during training.
--cut_preprocess
string
default:"Automatic"
required
Method used to split audio into chunks before processing. Choices: Skip (no splitting), Simple (fixed-length splits), Automatic (silence-based splits). Automatic is recommended for most datasets.
--cpu_cores
integer
Number of CPU cores to use during preprocessing. Accepts 164. Defaults to using all available cores when not specified.
--process_effects
boolean
default:"False"
When set to True, disables all internal audio filters during preprocessing. Use only if you have already pre-processed your audio externally. Accepts True or False.
--noise_reduction
boolean
default:"False"
Apply spectral noise reduction to each audio segment during preprocessing. Useful for datasets recorded in noisy environments. Accepts True or False.
--noise_reduction_strength
float
default:"0.7"
Intensity of the noise reduction filter. Range: 0.01.0. Only active when --noise_reduction True.
--chunk_len
float
default:"3.0"
Target chunk length in seconds when splitting audio. Accepts values from 0.5 to 5.0 in steps of 0.5. Shorter chunks increase the total number of training samples; longer chunks preserve more context.
--overlap_len
float
default:"0.3"
Overlap between consecutive chunks in seconds. Choices: 0.0, 0.1, 0.2, 0.3, 0.4. A small overlap reduces boundary artifacts at the cost of slightly redundant data.
--normalization_mode
string
default:"none"
Audio normalization strategy. Choices: none (no normalization), pre (normalize before slicing), post (normalize each chunk after slicing).

extract

The extract stage reads the preprocessed audio produced by preprocess and computes two types of features for every segment: F0 pitch curves (using the selected --f0_method) and speaker embeddings (using the selected --embedder_model). These features are written to logs/<model_name>/ and consumed directly by the train stage.

Flags

--model_name
string
required
Name of the model. Must match the name used during preprocessing.
--sample_rate
integer
required
Sample rate of the preprocessed data. Choices: 32000, 40000, 44100, 48000. Must match the value used during preprocessing.
--include_mutes
integer
default:"2"
required
Number of silent (mute) audio files to include in the training data. Range: 010. Including a small number of silent samples helps the model learn silence handling. Set to 0 to exclude silence entirely.
--f0_method
string
default:"rmvpe"
Pitch-extraction algorithm. Choices for extraction: crepe, crepe-tiny, rmvpe, fcpe. rmvpe is recommended for most voices; crepe may handle falsetto and high-pitched voices better.
--cpu_cores
integer
Number of CPU cores to use for feature extraction. Accepts 164. Optional; defaults to all available cores.
--gpu
string
default:"-"
GPU device index(es) to use for extraction (e.g., 0, 0-1). Pass - to use CPU only. GPU extraction is significantly faster for large datasets.
--embedder_model
string
default:"contentvec"
Speaker-embedding model. Choices: contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom. Language-specific HuBERT models can produce better embeddings for non-English datasets.
--embedder_model_custom
string
default:"None"
Path to a custom embedding model. Only used when --embedder_model custom.

train

The train stage reads the features produced by extract and optimizes the RVC generator and discriminator networks. On completion it automatically calls the index subcommand to build the FAISS retrieval index. Checkpoint weights are saved to logs/<model_name>/ according to the save frequency settings.
Enable --overtraining_detector True to automatically stop training when the model stops improving. Set --overtraining_threshold to the number of consecutive non-improving epochs that should trigger early stopping.

Flags

--model_name
string
required
Name of the model to train. Must match the name used in preprocessing and extraction.
--save_every_epoch
integer
required
Save a checkpoint every N epochs. Accepts 1100. Lower values give more recovery points but use more disk space.
--sample_rate
integer
required
Sample rate of the training data. Choices: 32000, 40000, 48000. Must match the value used in all previous stages.
--total_epoch
integer
default:"1000"
Total number of epochs to train. Accepts 110000. For most voices, 100500 epochs is a practical starting range; overtraining detection can stop training early.
--batch_size
integer
default:"8"
Number of audio samples per training step. Accepts 150. Larger batches train faster but require more GPU memory. Typical values: 416 on consumer GPUs.
--gpu
string
default:"0"
GPU device index for training (e.g., 0). Multi-GPU training uses a hyphen-separated list (e.g., 0-1).
--save_only_latest
boolean
default:"False"
When True, only the most recent checkpoint is kept on disk; older checkpoints are deleted. Saves disk space at the cost of losing rollback history. Accepts True or False.
--save_every_weights
boolean
default:"True"
When True, a full model weight file (.pth) is saved at each checkpoint interval, not just the training state. Accepts True or False.
--overtraining_detector
boolean
default:"False"
Enable automatic detection of overtraining. When active, training stops if the validation loss does not improve for --overtraining_threshold consecutive epochs. Accepts True or False.
--overtraining_threshold
integer
default:"50"
Number of consecutive epochs without improvement before training is stopped. Accepts 1100. Only active when --overtraining_detector True.
--pretrained
boolean
default:"True"
Initialize the model from Applio’s official pretrained base weights. Strongly recommended — training from scratch requires vastly more data and time. Accepts True or False.
--custom_pretrained
boolean
default:"False"
Use custom pretrained generator and discriminator weights instead of the official Applio pretraineds. When True, you must also supply --g_pretrained_path and --d_pretrained_path. Accepts True or False.
--g_pretrained_path
string
default:"None"
Path to a custom pretrained generator (G) model file. Only used when --custom_pretrained True.
--d_pretrained_path
string
default:"None"
Path to a custom pretrained discriminator (D) model file. Only used when --custom_pretrained True.
--vocoder
string
default:"HiFi-GAN"
Vocoder architecture to use for waveform synthesis. Choices: HiFi-GAN, MRF HiFi-GAN, RefineGAN. HiFi-GAN is the standard choice; MRF HiFi-GAN and RefineGAN may offer quality improvements on some voices.
--index_algorithm
string
default:"Auto"
FAISS index-building algorithm run automatically after training completes. Choices: Auto, Faiss, KMeans. Auto selects the best method based on dataset size.
--cache_data_in_gpu
boolean
default:"False"
Cache training feature tensors in GPU memory for faster data loading. Requires sufficient VRAM (approximately equal to the size of your extracted features). Accepts True or False.
--cleanup
boolean
default:"False"
Delete data from a previous training attempt for this model before starting. Use with caution — this cannot be undone. Accepts True or False.
--checkpointing
boolean
default:"False"
Enable gradient checkpointing to reduce VRAM usage during training at the cost of a small speed penalty. Useful for training on GPUs with limited memory. Accepts True or False.

index

The index subcommand builds or rebuilds the FAISS retrieval index for a model. This is run automatically at the end of train, but can also be called manually — for example to change the index algorithm without retraining.

Flags

--model_name
string
required
Name of the model for which to generate the index. Must have completed the train stage.
--index_algorithm
string
default:"Auto"
Algorithm to use when building the FAISS index. Choices: Auto, Faiss, KMeans. Auto is recommended for most dataset sizes.

End-to-end example

The following example trains a model called MyVoice at 40 kHz using four CPU cores for preprocessing, a single GPU for extraction and training, 200 epochs, and overtraining protection.
python core.py preprocess \
  --model_name MyVoice \
  --dataset_path /data/voice_samples \
  --sample_rate 40000 \
  --cpu_cores 4 \
  --cut_preprocess Automatic \
  --noise_reduction True \
  --noise_reduction_strength 0.5

python core.py extract \
  --model_name MyVoice \
  --f0_method rmvpe \
  --gpu 0 \
  --sample_rate 40000 \
  --embedder_model contentvec \
  --include_mutes 2

python core.py train \
  --model_name MyVoice \
  --total_epoch 200 \
  --save_every_epoch 10 \
  --batch_size 8 \
  --gpu 0 \
  --sample_rate 40000 \
  --overtraining_detector True \
  --overtraining_threshold 50 \
  --pretrained True \
  --vocoder HiFi-GAN \
  --index_algorithm Auto
The train command automatically runs index at the end. You only need to run python core.py index separately if you want to regenerate the index with a different algorithm after training has already finished.

Build docs developers (and LLMs) love