The parakeet_mlx.audio module provides utilities for loading audio files and converting them to log-mel spectrograms for processing by Parakeet models.
Functions
load_audio
Load an audio file and resample it to the target sampling rate.
from parakeet_mlx.audio import load_audio
import mlx.core as mx
audio = load_audio(
filename="audio.wav",
sampling_rate=16000,
dtype=mx.bfloat16
)
Parameters
Path to the audio file. Supports any format that FFmpeg can read (WAV, MP3, FLAC, etc.).Example: "audio.wav", Path("/path/to/audio.mp3")
Target sampling rate in Hz. Audio will be resampled to this rate.Parakeet models typically use 16000 Hz.Example: 16000
dtype
mx.Dtype
default:"mx.bfloat16"
MLX data type for the output array.Common options:
mx.bfloat16: Memory efficient (recommended)
mx.float32: Higher precision
Example: mx.bfloat16
Returns
1D array of audio samples normalized to the range [-1.0, 1.0].Shape: [num_samples]
Requirements
FFmpeg must be installed and available in your PATH. Install it with:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows
winget install ffmpeg
Example
from parakeet_mlx.audio import load_audio
from parakeet_mlx import from_pretrained
import mlx.core as mx
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# Load audio with the model's sample rate
audio = load_audio(
"audio.wav",
model.preprocessor_config.sample_rate,
dtype=mx.bfloat16
)
print(f"Audio shape: {audio.shape}")
print(f"Duration: {len(audio) / model.preprocessor_config.sample_rate:.2f}s")
get_logmel
Convert audio samples to log-mel spectrogram.
from parakeet_mlx.audio import get_logmel, load_audio
from parakeet_mlx import from_pretrained
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)
Parameters
1D array of audio samples (output from load_audio).Shape: [num_samples]
Preprocessing configuration. Use model.preprocessor_config to ensure compatibility with your model.This dataclass contains:
sample_rate: Audio sample rate
features: Number of mel filterbanks
n_fft: FFT window size
window_size: STFT window size in seconds
window_stride: STFT hop length in seconds
window: Window function (“hann”, “hamming”, “blackman”, “bartlett”)
normalize: Normalization strategy (“per_feature” or “global”)
preemph: Pre-emphasis coefficient
- Other parameters (see PreprocessArgs)
Returns
Log-mel spectrogram ready for model input.Shape: [1, sequence_length, mel_features]The output includes:
- Batch dimension of 1
- Normalized log-mel features
- Proper data type matching input
Processing Steps
get_logmel performs the following operations:
- Pre-emphasis: Apply first-order filter (if enabled)
- STFT: Short-time Fourier transform with specified window
- Magnitude: Compute power spectrum
- Mel filterbank: Apply mel-scale filterbank
- Logarithm: Convert to log scale
- Normalization: Normalize features (per-feature or global)
- Reshape: Add batch dimension
Example
from parakeet_mlx.audio import get_logmel, load_audio
from parakeet_mlx import from_pretrained
import mlx.core as mx
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# Load and convert audio
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)
print(f"Audio shape: {audio.shape}")
print(f"Mel shape: {mel.shape}") # [1, seq_len, n_mels]
# Use with model
results = model.generate(mel)
print(results[0].text)
Complete Example
Low-Level API Usage
import mlx.core as mx
from parakeet_mlx import from_pretrained, DecodingConfig
from parakeet_mlx.audio import load_audio, get_logmel
# Load model
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# Load and preprocess audio manually
audio = load_audio(
"audio.wav",
model.preprocessor_config.sample_rate,
dtype=mx.bfloat16
)
mel = get_logmel(audio, model.preprocessor_config)
print(f"Audio: {audio.shape}")
print(f"Mel: {mel.shape}")
# Generate transcription
results = model.generate(mel, decoding_config=DecodingConfig())
for result in results:
print(result.text)
Batch Processing
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio, get_logmel
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# Load multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
audio_data = [
load_audio(f, model.preprocessor_config.sample_rate)
for f in audio_files
]
# Convert to mel spectrograms
mels = [
get_logmel(audio, model.preprocessor_config)
for audio in audio_data
]
# Batch process (requires same length or padding)
mel_batch = mx.concatenate(mels, axis=0)
results = model.generate(mel_batch)
for filename, result in zip(audio_files, results):
print(f"{filename}: {result.text}")
Custom Audio Processing
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio, get_logmel
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# Load audio
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
# Process specific segment (10s to 20s)
sample_rate = model.preprocessor_config.sample_rate
start_sample = int(10.0 * sample_rate)
end_sample = int(20.0 * sample_rate)
audio_segment = audio[start_sample:end_sample]
mel_segment = get_logmel(audio_segment, model.preprocessor_config)
results = model.generate(mel_segment)
print(f"10s-20s: {results[0].text}")
Chunking Long Audio
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio, get_logmel
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# Load long audio
audio = load_audio("long_audio.wav", model.preprocessor_config.sample_rate)
# Process in chunks
chunk_duration = 60.0 # 60 seconds
sample_rate = model.preprocessor_config.sample_rate
chunk_samples = int(chunk_duration * sample_rate)
all_results = []
for i in range(0, len(audio), chunk_samples):
chunk = audio[i:i + chunk_samples]
# Skip very small chunks
if len(chunk) < model.preprocessor_config.hop_length:
break
mel = get_logmel(chunk, model.preprocessor_config)
results = model.generate(mel)
all_results.append(results[0].text)
print(f"Chunk {i // chunk_samples + 1}: {results[0].text}")
full_text = " ".join(all_results)
print(f"\nComplete: {full_text}")
Best Practice: Use model.transcribe() instead of manually calling load_audio and get_logmel. The high-level API handles chunking, overlaps, and merging automatically.
PreprocessArgs
The PreprocessArgs dataclass (from parakeet_mlx.audio) contains all audio preprocessing configuration:
@dataclass
class PreprocessArgs:
sample_rate: int
normalize: str
window_size: float
window_stride: float
window: str
features: int
n_fft: int
dither: float
pad_to: int = 0
pad_value: float = 0
preemph: float | None = 0.97
mag_power: float = 2.0
Access it from any model:
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
config = model.preprocessor_config
print(f"Sample rate: {config.sample_rate}")
print(f"Mel features: {config.features}")
print(f"Window: {config.window}")
print(f"Hop length: {config.hop_length}")
print(f"Win length: {config.win_length}")