Voice integration

ReadRealm turns any book in the catalog into spoken audio using Azure Cognitive Services. You can stream a full book as audio directly from the API, or engage in a real-time voice conversation powered by Azure’s realtime speech model. Voice features make ReadRealm useful for commuters, people with visual impairments, and anyone who wants to listen rather than read.

Stream audio by title

GET a continuous audio stream for any book — no download required.

Generate audio from text

POST book text directly to get back an MP3 audio response.

Real-time speech

Two-way voice interaction using a WebSocket connection and Azure realtime API.

Accessibility

Designed for hands-free reading and listeners with visual impairments.

Streaming audio by title

The quickest way to listen to a book is to call the TTS stream endpoint with the book’s title. ReadRealm looks up the book on Project Gutenberg, fetches its full text, and pipes the audio back to you as a continuous MP3 stream.

GET /book/tts/stream/{title}

Example:

curl -o moby-dick.mp3 \
  "http://localhost:3000/book/tts/stream/Moby%20Dick"

Response headers:

Header	Value
`Content-Type`	`audio/mpeg`
`Transfer-Encoding`	`chunked`
`Cache-Control`	`no-cache`
`Content-Disposition`	`inline`

Audio is delivered as a chunked transfer — your client begins playing before the entire book has been synthesized. This is important for long texts where generating the full audio upfront would take too long.

TTS requires the book to have plain-text content available on Project Gutenberg. If no text is found, the API returns a 404 with "No text content available for this book".

Generating audio from text

You can also send your own text content and receive synthesized audio back. This is useful when you want to listen to a passage, a review, or any custom text.

POST /book/ebook

Request body:

{
  "id": 0,
  "title": "",
  "author": "",
  "publicationDate": 0,
  "numOfPages": 0,
  "coverImage": "",
  "genre": "",
  "textData": "It was the best of times, it was the worst of times."
}

Only the textData field is used for audio generation. The other fields are required by the schema but do not affect the output. If textData is empty, a default book is used. The voice used for synthesis is Alloy (an Azure OpenAI voice), and audio is encoded as MP3. Text is processed in chunks of up to 4,096 characters. Response: A readable audio stream (audio/mpeg).

How audio streaming works

Fetch book text

ReadRealm retrieves the book’s plain-text content from Project Gutenberg using the book title. The text is cleaned — line breaks normalized, extra spaces removed — before being sent to Azure.

Send to Azure

The text is sent to Azure OpenAI’s TTS API using the alloy voice and mp3 response format. Text longer than 4,096 characters is chunked before processing.

Pipe the audio stream

The API streams the audio response directly to your client using chunked transfer encoding. The Content-Type is audio/mpeg so any audio player or <audio> element can consume it.

End of stream

When the audio is fully generated, the stream ends and the connection closes cleanly. If an error occurs mid-stream, the server sends a 500 response if headers have not yet been sent.

Real-time speech

ReadRealm includes a two-way real-time voice interface powered by Azure’s realtime speech model. Connect via WebSocket to start a session where you can speak and receive both transcript and audio responses back live.

Connecting

import { io } from 'socket.io-client';

const socket = io('http://localhost:3000', {
  transports: ['websocket'],
});

socket.on('connectionStatus', ({ connected }) => {
  console.log('Connected:', connected);
});

Starting a session

Send a start event with a system message (instructions for the AI) and a temperature value:

socket.emit('start', {
  systemMessage: 'You are a helpful reading assistant for ReadRealm.',
  temperature: 0.8,
});

You will receive a sessionStatus event confirming the session is active.

Sending audio

Send audio chunks as base64-encoded strings:

socket.emit('sendAudio', {
  audio: base64EncodedAudioChunk,
});

Audio chunks are accumulated on the server until a minimum buffer size (4,800 bytes) is reached, then forwarded to Azure in real time. Azure uses server-side VAD (voice activity detection) to determine when you have finished speaking.

Stopping a session

socket.emit('stop');

The server flushes any remaining audio to Azure, saves the session audio as an MP3 file, and emits a sessionStatus event with { active: false }.

Real-time events reference

Events you send

Event	Payload	Description
`start`	`{ systemMessage, temperature }`	Start a new speech session
`sendAudio`	`{ audio: string }`	Send a base64 audio chunk
`stop`	—	End the session

Events you receive

Event	Payload	Description
`connectionStatus`	`{ connected: true }`	Emitted on initial connection
`sessionStatus`	`{ active: boolean }`	Session started or stopped
`transcript`	`string`	Live text transcription of speech
`audio`	`string` (base64 delta)	Audio response chunk from AI
`state`	`InputState`	Session ready state
`error`	`string`	Error message

The transcript event streams incrementally as speech is recognized. Assemble the deltas in order to build the complete transcript.

Use cases

Use case	Feature to use
Commuting — listen to a full book	`GET /book/tts/stream/{title}`
Accessibility — hands-free reading	`GET /book/tts/stream/{title}` or `POST /book/ebook`
Interactive reading assistant	Real-time speech WebSocket
Listen to a specific passage	`POST /book/ebook` with custom `textData`

Get Started

Setup & Configuration

Core Features

Contributing

Stream audio by title

Generate audio from text

Real-time speech

Accessibility

Streaming audio by title

Generating audio from text

How audio streaming works

Real-time speech

Connecting

Starting a session

Sending audio

Stopping a session

Real-time events reference

Events you send

Events you receive

Use cases

Build docs developers (and LLMs) love

Get Started

Setup & Configuration

Core Features

Contributing

Documentation Index

Stream audio by title

Generate audio from text

Real-time speech

Accessibility

​Streaming audio by title

​Generating audio from text

​How audio streaming works

​Real-time speech

​Connecting

​Starting a session

​Sending audio

​Stopping a session

​Real-time events reference

​Events you send

​Events you receive

​Use cases

Build docs developers (and LLMs) love

Streaming audio by title

Generating audio from text

How audio streaming works

Real-time speech

Connecting

Starting a session

Sending audio

Stopping a session

Real-time events reference

Events you send

Events you receive

Use cases