Documentation Index Fetch the complete documentation index at: https://mintlify.com/shivammehta25/Matcha-TTS/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Matcha-TTS includes a Gradio-based web interface for interactive text-to-speech synthesis. The interface provides real-time parameter adjustment, speaker selection, and immediate audio playback.
Launching the Interface
Command Line
Python Script
# Launch with default settings
python -m matcha.app
The interface will launch in your browser with a shareable public URL.
Features
Model Selection
Choose between two pre-trained models:
Multi Speaker (VCTK) : 108 speakers from the VCTK dataset
Single Speaker (LJ Speech) : High-quality single female speaker
Switching models automatically:
Adjusts the speaker selector visibility
Updates default speaking rate
Switches example sets
Loads appropriate vocoder
Text Input
The main text box accepts any English text. The system will:
Clean and normalize the text
Convert to phonemes using english_cleaners2
Display the phonetized sequence
Synthesize speech
Speaker Selection
Multi-Speaker Mode (VCTK)
A slider appears with speaker IDs from 0 to 107. Each ID represents a different speaker from the VCTK dataset with unique voice characteristics.
Single-Speaker Mode (LJ Speech)
The speaker selector is hidden as only one voice is available.
Synthesis Controls
The interface exposes three key hyperparameters:
Number of ODE Steps
Range: 1-100
Default: 10
Impact: Quality vs. speed tradeoff
2 steps: Very fast, slight quality reduction
4 steps: Fast, good quality
10 steps: Balanced (default)
50 steps: High quality, slower
Length Scale (Speaking Rate)
Range: 0.5-1.5
Default: 0.85 (VCTK) / 0.95 (LJ Speech)
Impact: Speech pace
0.5: Fast speech (2x speed)
1.0: Normal pace
1.5: Slow speech (0.67x speed)
Sampling Temperature
Range: 0.00-2.00
Default: 0.667
Step: 0.16675
Impact: Prosody variation
0.00: Deterministic, flat prosody
0.667: Natural variation (default)
1.00: More expressive
2.00: Highly variable, potentially unstable
Real-Time Output
After synthesis, the interface displays:
Phonetized Text: Shows how the text was converted to phonemes
Mel-Spectrogram: Visual representation of the generated speech
Audio Player: Playback controls with download option
Interface Code Structure
Main Components
The Gradio app is defined in matcha/app.py:149-357:
with gr.Blocks( title = "🍵 Matcha-TTS" ) as demo:
# State management
processed_text = gr.State( value = None )
processed_text_len = gr.State( value = None )
# Model selector
model_type = gr.Radio(
[ "Multi Speaker (VCTK)" , "Single Speaker (LJ Speech)" ],
value = "Multi Speaker (VCTK)" ,
label = "Choose a Model"
)
# Text input and speaker control
text = gr.Textbox( value = "" , lines = 2 , label = "Text to synthesise" )
spk_slider = gr.Slider(
minimum = 0 , maximum = 107 , step = 1 , value = 0 ,
label = "Speaker ID"
)
# Synthesis parameters
n_timesteps = gr.Slider(
label = "Number of ODE steps" ,
minimum = 1 , maximum = 100 , step = 1 , value = 10
)
length_scale = gr.Slider(
label = "Length scale (Speaking rate)" ,
minimum = 0.5 , maximum = 1.5 , step = 0.05 , value = 1.0
)
mel_temp = gr.Slider(
label = "Sampling temperature" ,
minimum = 0.00 , maximum = 2.001 , step = 0.16675 , value = 0.667
)
# Outputs
phonetised_text = gr.Textbox( interactive = False )
mel_spectrogram = gr.Image( interactive = False )
audio = gr.Audio( interactive = False )
Synthesis Pipeline
The synthesis process follows two steps:
Step 1: Text Processing (matcha/app.py:102-104)
@torch.inference_mode ()
def process_text_gradio ( text ):
output = process_text( 1 , text, device)
return output[ "x_phones" ][ 1 :: 2 ], output[ "x" ], output[ "x_lengths" ]
Step 2: Mel Synthesis (matcha/app.py:108-122)
@torch.inference_mode ()
def synthesise_mel ( text , text_length , n_timesteps , temperature , length_scale , spk ):
spk = torch.tensor([spk], device = device, dtype = torch.long) if spk >= 0 else None
output = model.synthesise(
text,
text_length,
n_timesteps = n_timesteps,
temperature = temperature,
spks = spk,
length_scale = length_scale,
)
output[ "waveform" ] = to_waveform(output[ "mel" ], vocoder, denoiser)
with tempfile.NamedTemporaryFile( suffix = ".wav" , delete = False ) as fp:
sf.write(fp.name, output[ "waveform" ], 22050 , "PCM_24" )
return fp.name, plot_tensor(output[ "mel" ].squeeze().cpu().numpy())
Dynamic Model Loading
The interface loads models on-demand (matcha/app.py:72-98):
def load_model_ui ( model_type , textbox ):
model_name = RADIO_OPTIONS [model_type][ "model" ]
vocoder_name = RADIO_OPTIONS [model_type][ "vocoder" ]
global model, vocoder, denoiser, CURRENTLY_LOADED_MODEL
if CURRENTLY_LOADED_MODEL != model_name:
model, vocoder, denoiser = load_model(model_name, vocoder_name)
CURRENTLY_LOADED_MODEL = model_name
# Update UI based on model type
if model_name == "matcha_ljspeech" :
spk_slider = gr.update( visible = False , value =- 1 )
single_speaker_examples = gr.update( visible = True )
multi_speaker_examples = gr.update( visible = False )
length_scale = gr.update( value = 0.95 )
else :
spk_slider = gr.update( visible = True , value = 0 )
single_speaker_examples = gr.update( visible = False )
multi_speaker_examples = gr.update( visible = True )
length_scale = gr.update( value = 0.85 )
return ( ... )
Pre-Cached Examples
The interface includes cached examples for instant playback without synthesis:
Single Speaker Examples
# From matcha/app.py:236-285
examples = [
[
"We propose Matcha-TTS, a new approach to non-autoregressive neural TTS..." ,
50 , 0.677 , 0.95
],
[
"The Secret Service believed that it was very doubtful..." ,
2 , 0.677 , 0.95 # Fast, 2 steps
],
[
"The Secret Service believed that it was very doubtful..." ,
10 , 0.677 , 0.95 # Default, 10 steps
],
# More variations with different step counts
]
Multi-Speaker Examples
# From matcha/app.py:288-331
multi_speaker_examples = [
[ "Hello everyone! I am speaker 0..." , 10 , 0.677 , 0.85 , 0 ],
[ "Hello everyone! I am speaker 16..." , 10 , 0.677 , 0.85 , 16 ],
[ "Hello everyone! I am speaker 44..." , 50 , 0.677 , 0.85 , 44 ],
[ "Hello everyone! I am speaker 45..." , 50 , 0.677 , 0.85 , 45 ],
[ "Hello everyone! I am speaker 58..." , 4 , 0.677 , 0.85 , 58 ],
]
Examples are cached using cache_examples=True to avoid re-synthesis.
Model Configuration
Default configuration (matcha/app.py:23-28):
args = Namespace(
cpu = False , # Use GPU if available
model = "matcha_vctk" , # Start with multi-speaker
vocoder = "hifigan_univ_v1" , # Universal vocoder
spk = 0 , # Default speaker ID
)
Model options mapping (matcha/app.py:42-51):
RADIO_OPTIONS = {
"Multi Speaker (VCTK)" : {
"model" : "matcha_vctk" ,
"vocoder" : "hifigan_univ_v1" ,
},
"Single Speaker (LJ Speech)" : {
"model" : "matcha_ljspeech" ,
"vocoder" : "hifigan_T2_v1" ,
},
}
Technical Details
Audio Output Specifications
Sample Rate: 22050 Hz
Format: WAV (PCM_24)
Channels: Mono
Temporary Storage: Files created in system temp directory
GPU Support
The interface automatically detects GPU availability:
device = get_device(args)
# Returns torch.device("cuda") if GPU available, else torch.device("cpu")
Model Downloads
All models are automatically downloaded on first use (matcha/app.py:54-57):
assert_model_downloaded(MATCHA_TTS_LOC( "matcha_ljspeech" ), MATCHA_URLS [ "matcha_ljspeech" ])
assert_model_downloaded(VOCODER_LOC( "hifigan_T2_v1" ), VOCODER_URLS [ "hifigan_T2_v1" ])
assert_model_downloaded(MATCHA_TTS_LOC( "matcha_vctk" ), MATCHA_URLS [ "matcha_vctk" ])
assert_model_downloaded(VOCODER_LOC( "hifigan_univ_v1" ), VOCODER_URLS [ "hifigan_univ_v1" ])
Models are stored in the user data directory (typically ~/.local/share/matcha_tts/).
Customization
You can customize the interface by modifying matcha/app.py:
Add Custom Examples
custom_examples = gr.Examples(
examples = [
[ "Your custom text here" , 10 , 0.667 , 0.95 , 0 ],
# Add more examples
],
fn = multispeaker_example_cacher,
inputs = [text, n_timesteps, mel_temp, length_scale, spk_slider],
outputs = [phonetised_text, audio, mel_spectrogram],
cache_examples = True ,
)
Adjust Default Parameters
# Change default temperature
mel_temp = gr.Slider(
label = "Sampling temperature" ,
minimum = 0.00 ,
maximum = 2.001 ,
step = 0.16675 ,
value = 0.8 , # Changed from 0.667
interactive = True ,
)
Share Publicly
The interface launches with share=True by default, creating a public URL:
demo.queue().launch( share = True )
Set share=False for local-only access.
Tips for Best Results
Start with examples: Click cached examples to hear quality before experimenting
Adjust one parameter at a time: Easier to understand each parameter’s effect
Use 10 steps as baseline: Good balance for testing, increase for final output
Keep temperature at 0.667: Sweet spot for natural prosody
Try different speakers: VCTK has diverse voices (try 0, 16, 44, 45, 58)
Check phonetized text: Verify correct pronunciation before synthesis