Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/aliammari1/readrealm/llms.txt

Use this file to discover all available pages before exploring further.

ReadRealm turns any book in the catalog into spoken audio using Azure Cognitive Services. You can stream a full book as audio directly from the API, or engage in a real-time voice conversation powered by Azure’s realtime speech model. Voice features make ReadRealm useful for commuters, people with visual impairments, and anyone who wants to listen rather than read.

Stream audio by title

GET a continuous audio stream for any book — no download required.

Generate audio from text

POST book text directly to get back an MP3 audio response.

Real-time speech

Two-way voice interaction using a WebSocket connection and Azure realtime API.

Accessibility

Designed for hands-free reading and listeners with visual impairments.

Streaming audio by title

The quickest way to listen to a book is to call the TTS stream endpoint with the book’s title. ReadRealm looks up the book on Project Gutenberg, fetches its full text, and pipes the audio back to you as a continuous MP3 stream.
GET /book/tts/stream/{title}
Example:
curl -o moby-dick.mp3 \
  "http://localhost:3000/book/tts/stream/Moby%20Dick"
Response headers:
HeaderValue
Content-Typeaudio/mpeg
Transfer-Encodingchunked
Cache-Controlno-cache
Content-Dispositioninline
Audio is delivered as a chunked transfer — your client begins playing before the entire book has been synthesized. This is important for long texts where generating the full audio upfront would take too long.
TTS requires the book to have plain-text content available on Project Gutenberg. If no text is found, the API returns a 404 with "No text content available for this book".

Generating audio from text

You can also send your own text content and receive synthesized audio back. This is useful when you want to listen to a passage, a review, or any custom text.
POST /book/ebook
Request body:
{
  "id": 0,
  "title": "",
  "author": "",
  "publicationDate": 0,
  "numOfPages": 0,
  "coverImage": "",
  "genre": "",
  "textData": "It was the best of times, it was the worst of times."
}
Only the textData field is used for audio generation. The other fields are required by the schema but do not affect the output. If textData is empty, a default book is used. The voice used for synthesis is Alloy (an Azure OpenAI voice), and audio is encoded as MP3. Text is processed in chunks of up to 4,096 characters. Response: A readable audio stream (audio/mpeg).

How audio streaming works

1

Fetch book text

ReadRealm retrieves the book’s plain-text content from Project Gutenberg using the book title. The text is cleaned — line breaks normalized, extra spaces removed — before being sent to Azure.
2

Send to Azure

The text is sent to Azure OpenAI’s TTS API using the alloy voice and mp3 response format. Text longer than 4,096 characters is chunked before processing.
3

Pipe the audio stream

The API streams the audio response directly to your client using chunked transfer encoding. The Content-Type is audio/mpeg so any audio player or <audio> element can consume it.
4

End of stream

When the audio is fully generated, the stream ends and the connection closes cleanly. If an error occurs mid-stream, the server sends a 500 response if headers have not yet been sent.

Real-time speech

ReadRealm includes a two-way real-time voice interface powered by Azure’s realtime speech model. Connect via WebSocket to start a session where you can speak and receive both transcript and audio responses back live.

Connecting

import { io } from 'socket.io-client';

const socket = io('http://localhost:3000', {
  transports: ['websocket'],
});

socket.on('connectionStatus', ({ connected }) => {
  console.log('Connected:', connected);
});

Starting a session

Send a start event with a system message (instructions for the AI) and a temperature value:
socket.emit('start', {
  systemMessage: 'You are a helpful reading assistant for ReadRealm.',
  temperature: 0.8,
});
You will receive a sessionStatus event confirming the session is active.

Sending audio

Send audio chunks as base64-encoded strings:
socket.emit('sendAudio', {
  audio: base64EncodedAudioChunk,
});
Audio chunks are accumulated on the server until a minimum buffer size (4,800 bytes) is reached, then forwarded to Azure in real time. Azure uses server-side VAD (voice activity detection) to determine when you have finished speaking.

Stopping a session

socket.emit('stop');
The server flushes any remaining audio to Azure, saves the session audio as an MP3 file, and emits a sessionStatus event with { active: false }.

Real-time events reference

Events you send

EventPayloadDescription
start{ systemMessage, temperature }Start a new speech session
sendAudio{ audio: string }Send a base64 audio chunk
stopEnd the session

Events you receive

EventPayloadDescription
connectionStatus{ connected: true }Emitted on initial connection
sessionStatus{ active: boolean }Session started or stopped
transcriptstringLive text transcription of speech
audiostring (base64 delta)Audio response chunk from AI
stateInputStateSession ready state
errorstringError message
The transcript event streams incrementally as speech is recognized. Assemble the deltas in order to build the complete transcript.

Use cases

Use caseFeature to use
Commuting — listen to a full bookGET /book/tts/stream/{title}
Accessibility — hands-free readingGET /book/tts/stream/{title} or POST /book/ebook
Interactive reading assistantReal-time speech WebSocket
Listen to a specific passagePOST /book/ebook with custom textData

Build docs developers (and LLMs) love