Quick Start

This guide will help you start synthesizing speech with Matcha-TTS in minutes. We’ll cover three ways to use Matcha-TTS: the command-line interface (CLI), Python API, and web interface.

Make sure you’ve installed Matcha-TTS before proceeding. Pre-trained models will be automatically downloaded on first use.

CLI Usage

The command-line interface is the fastest way to synthesize speech from text.

Basic Synthesis

Synthesize a single utterance:

matcha-tts --text "Hello, this is Matcha TTS speaking."

This generates utterance_001.wav in your current directory.

Synthesize from File

Create a text file with one sentence per line:

matcha-tts --file sentences.txt

Each line will be synthesized as a separate audio file.

Batch Processing

For faster processing of multiple sentences, use batch mode:

matcha-tts --file sentences.txt --batched --batch_size 32

Batched processing is significantly faster when synthesizing many sentences, especially on GPU.

CLI Parameters

--text

string

Text to synthesize (alternative to —file)

--file

string

Path to text file with one sentence per line

--model

string

default:"matcha_ljspeech"

Model to use: matcha_ljspeech (single speaker) or matcha_vctk (multi-speaker)

--checkpoint_path

string

Path to custom model checkpoint (optional)

--vocoder

string

Vocoder to use: hifigan_T2_v1 or hifigan_univ_v1 (auto-selected based on model)

--speaking_rate

float

default:"0.95"

Speaking rate control (higher = slower). Default: 0.95 for LJSpeech, 0.85 for VCTK

--temperature

float

default:"0.667"

Sampling temperature for variation (higher = more variation)

--steps

int

default:"10"

Number of ODE solver steps (2-100). Fewer steps = faster but potentially lower quality

--spk

int

Speaker ID for multi-speaker models (0-107 for VCTK)

--output_folder

string

Directory to save output files (default: current directory)

--cpu

boolean

Force CPU inference (default: use GPU if available)

--batched

boolean

Enable batch processing mode

--batch_size

int

default:"32"

Batch size for batch mode

Advanced CLI Examples

# Slower speech (1.2x slower)
matcha-tts --text "Speak slowly and clearly." --speaking_rate 1.2

# Faster speech (0.8x normal speed)
matcha-tts --text "Speak quickly!" --speaking_rate 0.8

Python API

Use Matcha-TTS directly in your Python code for more control.

Basic Python Example

import torch
import soundfile as sf
from matcha.models.matcha_tts import MatchaTTS
from matcha.text import text_to_sequence
from matcha.utils.utils import intersperse
from matcha.hifigan.config import v1
from matcha.hifigan.models import Generator as HiFiGAN
from matcha.hifigan.env import AttrDict
from matcha.hifigan.denoiser import Denoiser

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load Matcha-TTS model
model = MatchaTTS.load_from_checkpoint(
    "path/to/matcha_ljspeech.ckpt", 
    map_location=device
)
model.eval()

# Load vocoder (HiFi-GAN)
h = AttrDict(v1)
vocoder = HiFiGAN(h).to(device)
vocoder.load_state_dict(
    torch.load("path/to/hifigan_T2_v1", map_location=device)["generator"]
)
vocoder.eval()
vocoder.remove_weight_norm()
denoiser = Denoiser(vocoder, mode="zeros")

# Prepare text
text = "Hello, this is Matcha TTS."
x = torch.tensor(
    intersperse(text_to_sequence(text, ["english_cleaners2"])[0], 0),
    dtype=torch.long,
    device=device
)[None]
x_lengths = torch.tensor([x.shape[-1]], dtype=torch.long, device=device)

# Synthesize
with torch.inference_mode():
    output = model.synthesise(
        x,
        x_lengths,
        n_timesteps=10,
        temperature=0.667,
        spks=None,
        length_scale=1.0
    )
    
    # Generate waveform
    audio = vocoder(output["mel"]).clamp(-1, 1)
    audio = denoiser(audio.squeeze(), strength=0.00025).cpu().squeeze()

# Save audio
sf.write("output.wav", audio.numpy(), 22050, "PCM_24")

print(f"Real-time factor: {output['rtf']:.4f}")

Synthesis Function Parameters

The synthesise() method accepts the following parameters:

torch.Tensor

Batch of phoneme sequences. Shape: (batch_size, max_text_length)

x_lengths

torch.Tensor

Lengths of each sequence in the batch. Shape: (batch_size,)

n_timesteps

int

Number of ODE solver steps (2-100)

temperature

float

default:"1.0"

Controls variance of terminal distribution

spks

torch.Tensor

Speaker IDs for multi-speaker models. Shape: (batch_size,)

length_scale

float

default:"1.0"

Controls speech pace (higher = slower)

Helper Functions

@torch.inference_mode()
def process_text(text: str, device):
    """Convert text to phoneme tensor."""
    x = torch.tensor(
        intersperse(text_to_sequence(text, ["english_cleaners2"])[0], 0),
        dtype=torch.long,
        device=device
    )[None]
    x_lengths = torch.tensor([x.shape[-1]], dtype=torch.long, device=device)
    return x, x_lengths

@torch.inference_mode()
def to_waveform(mel, vocoder, denoiser, strength=0.00025):
    """Convert mel-spectrogram to waveform."""
    audio = vocoder(mel).clamp(-1, 1)
    audio = denoiser(audio.squeeze(), strength=strength).cpu().squeeze()
    return audio

Multi-Speaker Example

# Load VCTK multi-speaker model
model = MatchaTTS.load_from_checkpoint(
    "path/to/matcha_vctk.ckpt", 
    map_location=device
)
model.eval()

# Prepare text
x, x_lengths = process_text("Hello from speaker zero.", device)

# Speaker ID (0-107 for VCTK)
spk = torch.tensor([0], device=device, dtype=torch.long)

# Synthesize with specific speaker
with torch.inference_mode():
    output = model.synthesise(
        x,
        x_lengths,
        n_timesteps=10,
        temperature=0.667,
        spks=spk,
        length_scale=0.85  # VCTK default
    )

Gradio Web Interface

Launch an interactive web interface for experimenting with Matcha-TTS:

matcha-tts-app

This starts a Gradio interface where you can:

Enter text and synthesize instantly
Switch between single-speaker and multi-speaker models
Adjust hyperparameters in real-time
Select different speakers (for VCTK model)
Listen to pre-cached examples

The Gradio app automatically downloads required models on first launch. The interface will be available at http://localhost:7860 by default.

Gradio Interface Code

The Gradio app implementation from matcha/app.py:

import gradio as gr
import torch
import soundfile as sf
from matcha.cli import (
    load_matcha,
    load_vocoder,
    process_text,
    to_waveform,
)

# Load models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = load_matcha("matcha_vctk", "path/to/model.ckpt", device)
vocoder, denoiser = load_vocoder("hifigan_univ_v1", "path/to/vocoder", device)

@torch.inference_mode()
def synthesise_mel(text, text_length, n_timesteps, temperature, length_scale, spk):
    spk = torch.tensor([spk], device=device, dtype=torch.long) if spk >= 0 else None
    output = model.synthesise(
        text,
        text_length,
        n_timesteps=n_timesteps,
        temperature=temperature,
        spks=spk,
        length_scale=length_scale,
    )
    output["waveform"] = to_waveform(output["mel"], vocoder, denoiser)
    return output["waveform"], output["mel"]

Jupyter Notebook

Matcha-TTS includes a Jupyter notebook (synthesis.ipynb) for interactive experimentation:

# From synthesis.ipynb
import datetime as dt
import IPython.display as ipd
import numpy as np
from tqdm.auto import tqdm

# Configuration
n_timesteps = 10
length_scale = 1.0
temperature = 0.667

# Synthesize and display
texts = [
    "The Secret Service believed that it was very doubtful that any "
    "President would ride regularly in a vehicle with a fixed top, "
    "even though transparent."
]

for i, text in enumerate(tqdm(texts)):
    output = synthesise(text)
    output['waveform'] = to_waveform(output['mel'], vocoder)
    
    # Calculate RTF
    t = (dt.datetime.now() - output['start_t']).total_seconds()
    rtf_w = t * 22050 / (output['waveform'].shape[-1])
    
    print(f"RTF: {output['rtf']:.6f}")
    print(f"RTF Waveform: {rtf_w:.6f}")
    
    # Display audio in notebook
    ipd.display(ipd.Audio(output['waveform'], rate=22050))
    
    # Save to file
    save_to_folder(i, output, "synth_output")

Performance Tips

Choosing the right number of steps

2-4 steps: Ultra-fast, slight quality reduction
10 steps (default): Good balance of speed and quality
50+ steps: Highest quality, diminishing returns beyond 50

GPU vs CPU

GPU is highly recommended:

GPU: RTF ~0.02 (50x real-time)
CPU: RTF ~0.5-1.0 (1-2x real-time)

Use --cpu flag only if GPU is unavailable.

Batch processing

For many utterances, use --batched mode:

matcha-tts --file large_file.txt --batched --batch_size 32

This can be 3-5x faster than processing individually.

Temperature and variation

0.333: Less variation, more consistent
0.667 (default): Natural variation
1.0+: More variation, potentially less stable

Output Format

Matcha-TTS generates:

Audio files: .wav format, 22050 Hz, PCM_24
Mel-spectrograms: .npy files (NumPy arrays)
Visualizations: .png spectrogram plots (when using CLI)

Next Steps

Training Custom Models

Learn how to train Matcha-TTS on your own dataset

ONNX Export

Export models to ONNX for deployment

API Reference

Detailed API documentation

Examples

More advanced usage examples

Get Started

Core Concepts

Training

Inference

Advanced

Quick Start

Quick Start

CLI Usage

Basic Synthesis

Synthesize from File

Batch Processing

CLI Parameters

Advanced CLI Examples

Python API

Basic Python Example

Synthesis Function Parameters

Helper Functions

Multi-Speaker Example

Gradio Web Interface

Gradio Interface Code

Jupyter Notebook

Performance Tips

Output Format

Next Steps

Training Custom Models

ONNX Export

API Reference

Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Inference

Advanced

Documentation Index

​Quick Start

​CLI Usage

​Basic Synthesis

​Synthesize from File

​Batch Processing

​CLI Parameters

​Advanced CLI Examples

​Python API

​Basic Python Example

​Synthesis Function Parameters

​Helper Functions

​Multi-Speaker Example

​Gradio Web Interface

​Gradio Interface Code

​Jupyter Notebook

​Performance Tips

​Output Format

​Next Steps

Training Custom Models

ONNX Export

API Reference

Examples

Build docs developers (and LLMs) love

Quick Start

CLI Usage

Basic Synthesis

Synthesize from File

Batch Processing

CLI Parameters

Advanced CLI Examples

Python API

Basic Python Example

Synthesis Function Parameters

Helper Functions

Multi-Speaker Example

Gradio Web Interface

Gradio Interface Code

Jupyter Notebook

Performance Tips

Output Format

Next Steps