Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/johnfactotum/foliate-js/llms.txt

Use this file to discover all available pages before exploring further.

The TTS module in foliate-js converts e-book content to SSML (Speech Synthesis Markup Language) documents. It does not produce audio itself — instead, its methods return complete <speak> XML strings that you pass to the browser’s Web Speech API, a cloud synthesizer, or any other speech engine. The module understands the structure of the document well enough to walk through it block by block, maintain language markup, and preserve phoneme annotations written with ssml:ph and ssml:alphabet attributes.

Initializing TTS for the current section

Call view.initTTS() to prepare the TTS engine for whichever section is currently loaded:
await view.initTTS(granularity, highlight)
ParameterTypeDefaultDescription
granularity'word' | 'sentence''word'How the text is segmented. 'word' moves segment by segment; 'sentence' moves sentence by sentence.
highlight(range: Range) => voidscroll to rangeCallback invoked with a DOM Range each time the current word or sentence changes. Use this to visually highlight the spoken text.
view.initTTS() creates a TTS instance and assigns it to view.tts. If TTS is already initialized for the current section (i.e., the section has not changed), calling it again is a no-op.
Call view.initTTS() again after the user navigates to a new section. The TTS instance is tied to a single section document.

The TTS class methods

After calling view.initTTS(), use view.tts to control playback:
view.tts.next(paused?)
string | undefined
Advances to the next block and returns an SSML string for that block. Returns undefined when the end of the section is reached. Pass true for paused to trigger the highlight callback for the block without synthesizing.
view.tts.prev(paused?)
string | undefined
Moves to the previous block and returns its SSML string.
view.tts.start()
string | undefined
Returns the SSML string for the first block (or resumes from the last known mark position). Call this when starting playback from the beginning of the section.
view.tts.resume()
string | undefined
Returns the SSML string for the current block, starting from the last mark that was set. Use this after a pause to resume mid-block.
view.tts.from(range)
string | undefined
Returns the SSML string for the block that contains the given DOM Range, starting from the position within that block that best matches the range. Useful for starting TTS from a user’s text selection.
view.tts.setMark(mark)
void
Called by your speech engine’s boundary/mark event handler to tell the TTS module which <mark> element was just reached. This keeps the highlight callback in sync with the synthesizer’s progress. mark is the name string from the SSML <mark> element.

The SSML output format

Each call to next(), prev(), start(), or resume() returns a serialized XML string like:
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">
  <mark name="0"/>This is the first word
  <mark name="1"/> of the sentence.
</speak>
Each <mark> element names a word or sentence boundary (depending on granularity). The marks correspond directly to the DOM ranges tracked internally, so setMark(name) can fire the highlight callback with the matching range. The module also preserves EPUB content semantics:
  • <em> and <strong> become SSML <emphasis>
  • <br> becomes SSML <break>
  • lang attributes become SSML xml:lang
  • ssml:ph attributes become SSML <phoneme ph="...">
  • ssml:alphabet attributes set the alphabet attribute on <phoneme> elements
There is no support for PLS lexicons or CSS Speech properties.

Example: integrating with the Web Speech API

import './foliate-js/view.js'

const view = document.createElement('foliate-view')
document.body.append(view)
await view.open('book.epub')

// Highlight the current word using a CSS class
let activeRange = null
const highlight = range => {
    // remove previous highlight
    activeRange = range
}

await view.initTTS('word', highlight)

const synth = window.speechSynthesis
let utterance = null
let playing = false

function speakSSML(ssml) {
    if (!ssml) {
        playing = false
        return
    }

    // The Web Speech API does not natively parse SSML,
    // so extract plain text for the utterance.
    const parser = new DOMParser()
    const doc = parser.parseFromString(ssml, 'application/xml')
    const text = doc.documentElement.textContent

    utterance = new SpeechSynthesisUtterance(text)
    utterance.lang = doc.documentElement.getAttributeNS(
        'http://www.w3.org/XML/1998/namespace', 'lang') ?? 'en'

    utterance.onboundary = e => {
        // Web Speech API does not support SSML marks;
        // use word boundaries as a rough proxy
    }

    utterance.onend = () => {
        if (playing) {
            const next = view.tts.next()
            speakSSML(next)
        }
    }

    synth.speak(utterance)
}

function startTTS() {
    playing = true
    speakSSML(view.tts.start())
}

function stopTTS() {
    playing = false
    synth.cancel()
}

function pauseTTS() {
    playing = false
    synth.pause()
}

function resumeTTS() {
    playing = true
    synth.resume()
}

document.getElementById('play').addEventListener('click', startTTS)
document.getElementById('stop').addEventListener('click', stopTTS)
document.getElementById('pause').addEventListener('click', pauseTTS)
document.getElementById('resume').addEventListener('click', resumeTTS)
Cloud TTS providers such as Google Cloud Text-to-Speech and Amazon Polly accept SSML natively, so you can pass the full string returned by next() directly. This gives you proper mark events and phoneme pronunciation support.

Using a cloud TTS service with full SSML

async function speakWithCloud(ssml) {
    if (!ssml) return

    const response = await fetch('https://your-tts-endpoint/synthesize', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ ssml }),
    })

    const audioBlob = await response.blob()
    const audio = new Audio(URL.createObjectURL(audioBlob))

    // Listen for mark events if your cloud provider streams them
    audio.addEventListener('ended', () => {
        speakWithCloud(view.tts.next())
    })

    await audio.play()
}

await view.initTTS('sentence', range => {
    // scroll to the sentence being spoken
    range.startContainer.scrollIntoView({ block: 'nearest' })
})

speakWithCloud(view.tts.start())
view.tts is scoped to the currently loaded section. When the reader moves to a new section, re-initialize:
view.addEventListener('load', async () => {
    if (playing) {
        synth.cancel()
        await view.initTTS('word', highlight)
        speakSSML(view.tts.start())
    }
})

Build docs developers (and LLMs) love