ReadRealm converts book text into audio using Azure OpenAI TTS (theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/aliammari1/readrealm/llms.txt
Use this file to discover all available pages before exploring further.
tts deployment). Two HTTP endpoints handle on-demand and streaming audio generation. A separate real-time speech feature — backed by Azure Cognitive Services Realtime — is WebSocket-based.
Configuration
The TTS and real-time speech services are configured through environment variables:| Variable | Used by | Description |
|---|---|---|
AZURE_API_TTS_KEY | TTS service | API key for the Azure OpenAI TTS deployment |
AZURE_API_TTS_ENDPOINT | TTS service | Endpoint URL for the Azure OpenAI TTS deployment |
AZURE_API_TTS_MODEL | TTS service | Deployment name (e.g. tts) |
AZURE_API_REALTIME_KEY | Real-time speech | API key for the Azure Realtime speech deployment |
AZURE_API_REALTIME_ENDPOINT | Real-time speech | Endpoint URL for the Azure Realtime speech gateway |
AZURE_API_REALTIME_MODEL | Real-time speech | Deployment name for real-time voice |
These variables are read from the NestJS
ConfigService. Map them in your environment or .env file before starting the API.GET /book/tts/stream/:title
Looks up a book by title on Gutendex, fetches its plain-text content, and streams the TTS audio as a chunkedaudio/mpeg response. The connection stays open until the full audio is piped through.
Path parameters
URL-encoded book title. The server decodes it before querying Gutendex (e.g.
alice%20in%20wonderland).Response headers
| Header | Value |
|---|---|
Content-Type | audio/mpeg |
Transfer-Encoding | chunked |
Cache-Control | no-cache |
Content-Disposition | inline |
Response body
A binary MP3 audio stream piped directly from the Azure OpenAI response. Write it to a file or pipe it to a media player.Error responses
| Status | Condition |
|---|---|
404 | No book with the given title found on Gutendex, or book has no plain-text content |
500 | Azure TTS call failed or stream error occurred |
POST /book/ebook
Generates TTS audio from a book body supplied directly in the request. Use this when you already have the book text and do not need the server to fetch it from Gutendex. If thetextData field is an empty string, the server substitutes a default CreateBookDto instance before invoking Azure TTS.
Request body
Numeric book ID.
Author name.
Book title.
Publication year.
Page count.
Cover image URL.
Genre.
Plain-text content to synthesise. Only the first 4,096 characters are sent to Azure. If empty, a default book object is used.
Response
A binary audio response body (audio/mpeg) returned by the Azure OpenAI TTS API. The voice used is alloy and the output format is mp3.
Error responses
| Status | Condition |
|---|---|
404 | textData is absent and the fallback CreateBookDto also has no text |
500 | Azure TTS call failed |
Real-time speech (WebSocket)
ReadRealm also supports two-way real-time voice powered by the Azure Cognitive Services Realtime API. This uses a WebSocket connection managed by theSpeechRealtimeService:
- The client streams raw PCM audio buffers to the server.
- The server forwards them to Azure in fixed-size chunks (4,800 bytes at 24 kHz, mono).
- Azure performs server-side VAD (voice activity detection) using
whisper-1for transcription. - Audio responses (
response.audio.delta) and transcript deltas are forwarded back to the client in real time via the socket. - On stop, the session audio is saved to MP3 using FFmpeg (
libmp3lame, 128 kbps).
For connection setup, event names, and payload formats, see WebSockets.
WebSocket events overview
| Direction | Event | Description |
|---|---|---|
| Server → Client | audio | Base64-encoded audio delta or "Session start" / "clear" control strings |
| Server → Client | transcript | Incremental transcript text or status markers |
| Server → Client | state | Input state change: 0 (Working), 1 (ReadyToStart), 2 (ReadyToStop) |
| Server → Client | error | Error message string |
WebSocket reference
Full connection flow, client events, and payload schemas for both chat and real-time speech.