High-Level Architecture
Core Modules
speak-mintlify is organized into several core modules, each with a specific responsibility:1. Configuration Manager (config.ts)
Purpose: Unifies configuration from multiple sources with clear priority.
Priority Order:
- CLI flags (highest priority)
- Environment variables
speaker-config.yamlfile- Default values (lowest priority)
loadSpeakerConfig()- Loads voice and component settings from YAMLresolveConfig()- Merges all config sources and validates required fields
2. Content Extractor (extractor.ts)
Purpose: Converts MDX files into clean, TTS-friendly text.
Processing Pipeline:
Key Features:
- Uses unified/remark for AST-based processing
- Removes imports, exports, and JSX components
- Preserves paragraph structure and readability
- Cleans up orphaned punctuation
extractCleanText()- Main text extraction functionfindFrontmatterEnd()- Locates frontmatter boundaryextractFrontmatter()- Parses YAML frontmatter
3. Hash Tracker (hash-tracker.ts)
Purpose: Tracks content changes to avoid regenerating unchanged files.
Metadata Storage:
generateHash()- Creates SHA-256 hash of contentloadMetadata()- Loads.audio-metadata.jsonsaveMetadata()- Persists metadata to diskhasContentChanged()- Compares hashes to detect changesupdateMetadata()- Updates metadata for a file
4. Fish Audio Client (fish-api.ts)
Purpose: Wrapper around the Fish Audio SDK for TTS generation.
Key Features:
- Automatic retry logic with exponential backoff (3 retries)
- Parallel voice generation
- Buffer management for audio data
generateTTS()- Generates audio for a single voicegenerateMultipleVoices()- Generates audio for multiple voices in parallelcreateFishAudioClient()- Factory function
5. S3 Uploader (s3-upload.ts)
Purpose: Handles uploads to S3-compatible storage (AWS S3, Cloudflare R2, MinIO).
File Organization:
uploadAudio()- Uploads single audio fileuploadMultipleVoices()- Uploads multiple files in parallellistAllAudioFiles()- Lists all files with paginationdeleteMultiple()- Batch deletes filesextractKeyFromUrl()- Converts public URL to S3 key
6. Component Injector (injector.ts)
Purpose: Injects audio player components into MDX files using AST manipulation.
Injection Strategy:
- Parse MDX content into AST
- Locate import section (after frontmatter)
- Find first content node (heading/paragraph)
- Insert import statement
- Insert hash comment and component
injectAudioComponent()- Adds audio component to MDXextractExistingAudioData()- Reads existing component from ASThasAudioComponent()- Checks if component existsremoveAudioComponent()- Removes component from MDX
7. Utilities (utils.ts)
Purpose: Common file operations and MDX discovery.
Key Functions:
findMDXFiles()- Discovers MDX files using glob patternsreadFile()/writeFile()- File I/O operationsloadSpeakIgnore()- Loads.speakignorepatterns
8. Validators (validators.ts)
Purpose: Validates configuration for each command.
Key Functions:
validateGenerateConfig()- Ensures Fish API key and voices are configuredvalidateCleanupConfig()- Validates S3 configuration
Data Flow
Generate Command Flow
Cleanup Command Flow
Error Handling
speak-mintlify implements robust error handling at multiple levels:Retry Logic
Fish Audio API calls usep-retry with:
- 3 retry attempts
- Exponential backoff
- Console warnings on failed attempts
Validation
Configuration validation happens early:- Missing required fields throw descriptive errors
- Provides guidance on how to set missing values
- Validates voice ID/name array lengths match
File Operations
File I/O errors are handled gracefully:- Missing
.audio-metadata.jsonreturns empty metadata - Missing
speaker-config.yamluses defaults - Missing
.speakignoreuses default patterns
Performance Optimizations
Parallel Processing
- Multiple voices generated in parallel per file
- Multiple audio uploads to S3 in parallel
- Batch deletion of S3 objects (up to 1000 at once)
Change Detection
- SHA-256 hashing prevents unnecessary regeneration
- Only changed files are processed
- Metadata cached in
.audio-metadata.json
AST-Based Processing
- Efficient MDX parsing with unified/remark
- Precise component injection and extraction
- No regex-based text manipulation
File Structure
Extension Points
speak-mintlify is designed to be extensible:Custom Storage Backends
TheS3Uploader class can be extended to support other storage providers:
Custom TTS Providers
TheFishAudioClient interface can be implemented for other TTS services:
Custom Components
The component injector supports custom audio player components:Dependencies
Key dependencies and their purposes:- unified/remark - MDX parsing and AST manipulation
- fish-audio - Official Fish Audio SDK
- @aws-sdk/client-s3 - S3-compatible storage
- p-retry - Automatic retry logic
- glob - File pattern matching
- commander - CLI framework
Security Considerations
Secrets Management
- API keys and credentials via environment variables
- No secrets stored in
speaker-config.yaml - S3 credentials use AWS SDK’s secure credential chain
Content Safety
- Frontmatter and imports stripped from TTS content
- JSX components removed to prevent code injection
- File paths normalized to prevent directory traversal
Access Control
- S3 bucket permissions managed externally
- Public URLs configured independently
- No built-in authentication (relies on S3/CDN)
Best Practices
- Use
.speakignoreto exclude sensitive or auto-generated files - Set up CDN for S3 public URLs to improve performance
- Use environment variables for credentials (never commit them)
- Run with
--dry-runfirst to preview changes - Keep
speaker-config.yamlin version control (no secrets) - Add
.audio-metadata.jsonto.gitignoreif preferred
Future Architecture Considerations
Potential enhancements for future versions:- Plugin system for custom extractors and injectors
- Streaming TTS generation for large documents
- Multi-language support with language detection
- Audio caching service to avoid regeneration
- Webhook integration for CI/CD pipelines
- Real-time progress tracking with WebSockets
