Training a custom RVC voice model in Applio is a four-stage pipeline. Each stage is exposed as a separate subcommand inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/IAHispano/Applio/llms.txt
Use this file to discover all available pages before exploring further.
core.py, which lets you inspect intermediate results, re-run individual stages without repeating earlier work, and parallelize or schedule stages independently. The stages must be run in order: preprocessing cleans and segments raw audio; extraction computes pitch curves and speaker embeddings; training optimizes the model weights; and index generation builds a FAISS index for retrieval-based conversion. The train subcommand automatically triggers index generation on successful completion, but you can also run index separately at any time.
Complete pipeline
Preprocess the dataset
Slice, filter, and normalize your raw audio files so they are ready for feature extraction.
preprocess
The preprocess stage reads raw audio files from--dataset_path, slices them into manageable chunks, applies optional noise reduction and normalization, and writes the processed segments into logs/<model_name>/. The quality of this step directly affects the quality of the trained model.
Flags
Name of the model. Determines the subdirectory under
logs/ where preprocessed data is stored. Must match the names used in subsequent pipeline stages.Path to the directory containing raw training audio files. Applio accepts common formats (WAV, MP3, FLAC, etc.).
Target sample rate for all processed audio. Choices:
32000, 40000, 48000. All files will be resampled to this rate. Choose a rate that matches the vocoder you plan to use during training.Method used to split audio into chunks before processing. Choices:
Skip (no splitting), Simple (fixed-length splits), Automatic (silence-based splits). Automatic is recommended for most datasets.Number of CPU cores to use during preprocessing. Accepts
1–64. Defaults to using all available cores when not specified.When set to
True, disables all internal audio filters during preprocessing. Use only if you have already pre-processed your audio externally. Accepts True or False.Apply spectral noise reduction to each audio segment during preprocessing. Useful for datasets recorded in noisy environments. Accepts
True or False.Intensity of the noise reduction filter. Range:
0.0–1.0. Only active when --noise_reduction True.Target chunk length in seconds when splitting audio. Accepts values from
0.5 to 5.0 in steps of 0.5. Shorter chunks increase the total number of training samples; longer chunks preserve more context.Overlap between consecutive chunks in seconds. Choices:
0.0, 0.1, 0.2, 0.3, 0.4. A small overlap reduces boundary artifacts at the cost of slightly redundant data.Audio normalization strategy. Choices:
none (no normalization), pre (normalize before slicing), post (normalize each chunk after slicing).extract
The extract stage reads the preprocessed audio produced bypreprocess and computes two types of features for every segment: F0 pitch curves (using the selected --f0_method) and speaker embeddings (using the selected --embedder_model). These features are written to logs/<model_name>/ and consumed directly by the train stage.
Flags
Name of the model. Must match the name used during preprocessing.
Sample rate of the preprocessed data. Choices:
32000, 40000, 44100, 48000. Must match the value used during preprocessing.Number of silent (mute) audio files to include in the training data. Range:
0–10. Including a small number of silent samples helps the model learn silence handling. Set to 0 to exclude silence entirely.Pitch-extraction algorithm. Choices for extraction:
crepe, crepe-tiny, rmvpe, fcpe. rmvpe is recommended for most voices; crepe may handle falsetto and high-pitched voices better.Number of CPU cores to use for feature extraction. Accepts
1–64. Optional; defaults to all available cores.GPU device index(es) to use for extraction (e.g.,
0, 0-1). Pass - to use CPU only. GPU extraction is significantly faster for large datasets.Speaker-embedding model. Choices:
contentvec, spin, spin-v2, chinese-hubert-base, japanese-hubert-base, korean-hubert-base, custom. Language-specific HuBERT models can produce better embeddings for non-English datasets.Path to a custom embedding model. Only used when
--embedder_model custom.train
The train stage reads the features produced byextract and optimizes the RVC generator and discriminator networks. On completion it automatically calls the index subcommand to build the FAISS retrieval index. Checkpoint weights are saved to logs/<model_name>/ according to the save frequency settings.
Flags
Name of the model to train. Must match the name used in preprocessing and extraction.
Save a checkpoint every N epochs. Accepts
1–100. Lower values give more recovery points but use more disk space.Sample rate of the training data. Choices:
32000, 40000, 48000. Must match the value used in all previous stages.Total number of epochs to train. Accepts
1–10000. For most voices, 100–500 epochs is a practical starting range; overtraining detection can stop training early.Number of audio samples per training step. Accepts
1–50. Larger batches train faster but require more GPU memory. Typical values: 4–16 on consumer GPUs.GPU device index for training (e.g.,
0). Multi-GPU training uses a hyphen-separated list (e.g., 0-1).When
True, only the most recent checkpoint is kept on disk; older checkpoints are deleted. Saves disk space at the cost of losing rollback history. Accepts True or False.When
True, a full model weight file (.pth) is saved at each checkpoint interval, not just the training state. Accepts True or False.Enable automatic detection of overtraining. When active, training stops if the validation loss does not improve for
--overtraining_threshold consecutive epochs. Accepts True or False.Number of consecutive epochs without improvement before training is stopped. Accepts
1–100. Only active when --overtraining_detector True.Initialize the model from Applio’s official pretrained base weights. Strongly recommended — training from scratch requires vastly more data and time. Accepts
True or False.Use custom pretrained generator and discriminator weights instead of the official Applio pretraineds. When
True, you must also supply --g_pretrained_path and --d_pretrained_path. Accepts True or False.Path to a custom pretrained generator (G) model file. Only used when
--custom_pretrained True.Path to a custom pretrained discriminator (D) model file. Only used when
--custom_pretrained True.Vocoder architecture to use for waveform synthesis. Choices:
HiFi-GAN, MRF HiFi-GAN, RefineGAN. HiFi-GAN is the standard choice; MRF HiFi-GAN and RefineGAN may offer quality improvements on some voices.FAISS index-building algorithm run automatically after training completes. Choices:
Auto, Faiss, KMeans. Auto selects the best method based on dataset size.Cache training feature tensors in GPU memory for faster data loading. Requires sufficient VRAM (approximately equal to the size of your extracted features). Accepts
True or False.Delete data from a previous training attempt for this model before starting. Use with caution — this cannot be undone. Accepts
True or False.Enable gradient checkpointing to reduce VRAM usage during training at the cost of a small speed penalty. Useful for training on GPUs with limited memory. Accepts
True or False.index
Theindex subcommand builds or rebuilds the FAISS retrieval index for a model. This is run automatically at the end of train, but can also be called manually — for example to change the index algorithm without retraining.
Flags
Name of the model for which to generate the index. Must have completed the
train stage.Algorithm to use when building the FAISS index. Choices:
Auto, Faiss, KMeans. Auto is recommended for most dataset sizes.End-to-end example
The following example trains a model calledMyVoice at 40 kHz using four CPU cores for preprocessing, a single GPU for extraction and training, 200 epochs, and overtraining protection.
The
train command automatically runs index at the end. You only need to run python core.py index separately if you want to regenerate the index with a different algorithm after training has already finished.