Text-to-Speech output with SSML in foliate-js

The TTS module in foliate-js converts e-book content to SSML (Speech Synthesis Markup Language) documents. It does not produce audio itself — instead, its methods return complete <speak> XML strings that you pass to the browser’s Web Speech API, a cloud synthesizer, or any other speech engine. The module understands the structure of the document well enough to walk through it block by block, maintain language markup, and preserve phoneme annotations written with ssml:ph and ssml:alphabet attributes.

Initializing TTS for the current section

Call view.initTTS() to prepare the TTS engine for whichever section is currently loaded:

await view.initTTS(granularity, highlight)

Parameter	Type	Default	Description
`granularity`	`'word'` \| `'sentence'`	`'word'`	How the text is segmented. `'word'` moves segment by segment; `'sentence'` moves sentence by sentence.
`highlight`	`(range: Range) => void`	scroll to range	Callback invoked with a DOM `Range` each time the current word or sentence changes. Use this to visually highlight the spoken text.

view.initTTS() creates a TTS instance and assigns it to view.tts. If TTS is already initialized for the current section (i.e., the section has not changed), calling it again is a no-op.

Call view.initTTS() again after the user navigates to a new section. The TTS instance is tied to a single section document.

The `TTS` class methods

After calling view.initTTS(), use view.tts to control playback:

view.tts.next(paused?)

string | undefined

Advances to the next block and returns an SSML string for that block. Returns undefined when the end of the section is reached. Pass true for paused to trigger the highlight callback for the block without synthesizing.

view.tts.prev(paused?)

string | undefined

Moves to the previous block and returns its SSML string.

view.tts.start()

string | undefined

Returns the SSML string for the first block (or resumes from the last known mark position). Call this when starting playback from the beginning of the section.

view.tts.resume()

string | undefined

Returns the SSML string for the current block, starting from the last mark that was set. Use this after a pause to resume mid-block.

view.tts.from(range)

string | undefined

Returns the SSML string for the block that contains the given DOM Range, starting from the position within that block that best matches the range. Useful for starting TTS from a user’s text selection.

view.tts.setMark(mark)

void

Called by your speech engine’s boundary/mark event handler to tell the TTS module which  element was just reached. This keeps the highlight callback in sync with the synthesizer’s progress. mark is the name string from the SSML  element.

The SSML output format

Each call to next(), prev(), start(), or resume() returns a serialized XML string like:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">
  <mark name="0"/>This is the first word
  <mark name="1"/> of the sentence.
</speak>

Each  element names a word or sentence boundary (depending on granularity). The marks correspond directly to the DOM ranges tracked internally, so setMark(name) can fire the highlight callback with the matching range. The module also preserves EPUB content semantics:

 and  become SSML <emphasis>
  becomes SSML <break>
lang attributes become SSML xml:lang
ssml:ph attributes become SSML <phoneme ph="...">
ssml:alphabet attributes set the alphabet attribute on <phoneme> elements

There is no support for PLS lexicons or CSS Speech properties.

Example: integrating with the Web Speech API

import './foliate-js/view.js'

const view = document.createElement('foliate-view')
document.body.append(view)
await view.open('book.epub')

// Highlight the current word using a CSS class
let activeRange = null
const highlight = range => {
    // remove previous highlight
    activeRange = range
}

await view.initTTS('word', highlight)

const synth = window.speechSynthesis
let utterance = null
let playing = false

function speakSSML(ssml) {
    if (!ssml) {
        playing = false
        return
    }

    // The Web Speech API does not natively parse SSML,
    // so extract plain text for the utterance.
    const parser = new DOMParser()
    const doc = parser.parseFromString(ssml, 'application/xml')
    const text = doc.documentElement.textContent

    utterance = new SpeechSynthesisUtterance(text)
    utterance.lang = doc.documentElement.getAttributeNS(
        'http://www.w3.org/XML/1998/namespace', 'lang') ?? 'en'

    utterance.onboundary = e => {
        // Web Speech API does not support SSML marks;
        // use word boundaries as a rough proxy
    }

    utterance.onend = () => {
        if (playing) {
            const next = view.tts.next()
            speakSSML(next)
        }
    }

    synth.speak(utterance)
}

function startTTS() {
    playing = true
    speakSSML(view.tts.start())
}

function stopTTS() {
    playing = false
    synth.cancel()
}

function pauseTTS() {
    playing = false
    synth.pause()
}

function resumeTTS() {
    playing = true
    synth.resume()
}

document.getElementById('play').addEventListener('click', startTTS)
document.getElementById('stop').addEventListener('click', stopTTS)
document.getElementById('pause').addEventListener('click', pauseTTS)
document.getElementById('resume').addEventListener('click', resumeTTS)

Cloud TTS providers such as Google Cloud Text-to-Speech and Amazon Polly accept SSML natively, so you can pass the full string returned by next() directly. This gives you proper mark events and phoneme pronunciation support.

Using a cloud TTS service with full SSML

async function speakWithCloud(ssml) {
    if (!ssml) return

    const response = await fetch('https://your-tts-endpoint/synthesize', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ ssml }),
    })

    const audioBlob = await response.blob()
    const audio = new Audio(URL.createObjectURL(audioBlob))

    // Listen for mark events if your cloud provider streams them
    audio.addEventListener('ended', () => {
        speakWithCloud(view.tts.next())
    })

    await audio.play()
}

await view.initTTS('sentence', range => {
    // scroll to the sentence being spoken
    range.startContainer.scrollIntoView({ block: 'nearest' })
})

speakWithCloud(view.tts.start())

Navigating between sections

view.tts is scoped to the currently loaded section. When the reader moves to a new section, re-initialize:

view.addEventListener('load', async () => {
    if (playing) {
        synth.cancel()
        await view.initTTS('word', highlight)
        speakSSML(view.tts.start())
    }
})

Getting Started

Core Concepts

Guides

Text-to-Speech output with SSML in foliate-js

Initializing TTS for the current section

The `TTS` class methods

The SSML output format

Example: integrating with the Web Speech API

Using a cloud TTS service with full SSML

Navigating between sections

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Documentation Index

​Initializing TTS for the current section

​The TTS class methods

​The SSML output format

​Example: integrating with the Web Speech API

​Using a cloud TTS service with full SSML

​Navigating between sections

Build docs developers (and LLMs) love

Initializing TTS for the current section

The `TTS` class methods

The SSML output format

Example: integrating with the Web Speech API

Using a cloud TTS service with full SSML

Navigating between sections