Local attention is a memory-efficient alternative to full attention that restricts the attention mechanism to a fixed-size context window. This optimization is particularly valuable for processing long audio files without chunking.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/senstella/parakeet-mlx/llms.txt
Use this file to discover all available pages before exploring further.
Why Local Attention?
Full self-attention has quadratic memory complexity O(n²) with respect to sequence length. For long audio files, this can lead to excessive memory usage. Local attention reduces this to O(n·w) where w is the context window size, making it practical to transcribe hours of audio without chunking.Local attention is especially useful when transcribing long audio files (30+ minutes) without chunking, or when running on devices with limited memory.
How It Works
Instead of attending to all positions in the sequence, local attention restricts each position to attend only to a fixed window of neighboring positions:- Left context: How many frames before the current position
- Right context: How many frames after the current position
Context Size Selection
The context size determines the trade-off between memory usage and model accuracy:| Context Size | Memory Usage | Accuracy | Use Case |
|---|---|---|---|
| (128, 128) | Low | Good | Memory-constrained devices |
| (256, 256) | Medium | Better | Recommended default |
| (512, 512) | High | Best | When memory allows |
Memory Savings Example
Technical Implementation
Local attention in Parakeet MLX uses custom Metal kernels for efficient computation on Apple Silicon:Combining with Chunking
Switching Back to Full Attention
CLI Usage
Enable local attention from the command line:Performance Characteristics
Time Complexity
- Full attention: O(n² · d) where n is sequence length, d is feature dimension
- Local attention: O(n · w · d) where w is context window size
Memory Complexity
- Full attention: O(n² · h) where h is number of heads
- Local attention: O(n · w · h)
Accuracy Impact
- For most speech, local context of 256 frames is sufficient
- Minimal degradation compared to full attention
- May affect very long-range dependencies (rare in speech)
Best Practices
- Default Context: Use
(256, 256)for most applications - Long Audio: Enable local attention for files longer than 30 minutes
- Memory Constraints: Reduce context size to
(128, 128)if needed - Quality Critical: Increase to
(512, 512)for maximum accuracy - Benchmarking: Test on your specific audio to find optimal settings
Related
- Beam Decoding - Improve transcription quality
- Sentence Splitting - Control output segmentation
- Low-Level API - Direct feature extraction