TTS

Overview

The TTS (Text-to-Speech) class is an abstract base class for implementing speech synthesis in Vision Agents. It handles text-to-audio conversion, audio resampling, event emission, and performance tracking. Location: vision_agents.core.tts.tts.TTS

Usage

from vision_agents.core.tts.tts import TTS
from getstream.video.rtc.track_util import PcmData, AudioFormat

class MyTTS(TTS):
    async def stream_audio(self, text: str, *args, **kwargs) -> PcmData:
        # Call TTS service
        audio_bytes = await self.tts_api.synthesize(text)
        
        # Return as PcmData
        return PcmData.from_bytes(
            audio_bytes,
            sample_rate=24000,
            channels=1,
            format=AudioFormat.S16
        )
    
    async def stop_audio(self) -> None:
        # Stop any ongoing synthesis
        await self.tts_api.cancel()

Constructor

def __init__(
    self,
    provider_name: Optional[str] = None,
):

provider_name

Optional[str]

Name of the TTS provider (e.g., “openai”, “elevenlabs”, “cartesia”). If not provided, uses the class name.

Abstract Methods

stream_audio

async method

required

Convert text to speech audio data.This method must be implemented by subclasses. It can return audio in several formats:

Single PcmData object (entire audio)
Iterator of PcmData objects (streaming)
Async iterator of PcmData objects (async streaming)

The base TTS class handles resampling to the desired format and emits appropriate events.Parameters:

text (str): The text to convert to speech
*args: Additional arguments
**kwargs: Additional keyword arguments

Returns: PcmData | Iterator[PcmData] | AsyncIterator[PcmData]Example (single buffer):

async def stream_audio(self, text: str, *_, **__) -> PcmData:
    response = await self.client.audio.speech.create(
        model="gpt-4o-mini-tts",
        voice="alloy",
        input=text,
        response_format="pcm",
    )
    return PcmData.from_bytes(
        response.content,
        sample_rate=24000,
        channels=1,
        format=AudioFormat.S16
    )

Example (streaming):

async def stream_audio(self, text: str, *_, **__):
    async for chunk in self.client.synthesize_stream(text):
        yield PcmData.from_bytes(
            chunk,
            sample_rate=22050,
            channels=1,
            format=AudioFormat.S16
        )

stop_audio

async method

required

Stop audio synthesis and clear any queues.This method is called when:

User interrupts the agent
Turn detection indicates user started speaking
Agent needs to stop talking

Implementations should:

Cancel ongoing synthesis requests
Clear internal audio buffers
Stop any playback tasks

Example:

async def stop_audio(self) -> None:
    if self.synthesis_task:
        self.synthesis_task.cancel()
    if self.audio_queue:
        self.audio_queue.clear()

Public Methods

send

async method

Convert text to speech and emit audio events.This is the main method used by agents. It:

Calls stream_audio() to get audio
Resamples audio to the desired format
Emits TTSAudioEvent for each chunk
Tracks performance metrics
Emits TTSSynthesisCompleteEvent when done

Parameters:

text (str): The text to convert to speech
participant (Optional[Participant]): Participant to associate with the audio
*args: Additional arguments passed to stream_audio()
**kwargs: Additional keyword arguments passed to stream_audio()

Example:

# Agent calls send() to generate and emit audio
await tts.send(
    text="Hello, how can I help you?",
    participant=current_participant
)

set_output_format

method

Set the desired output audio format for emitted events.The agent should call this with its output track properties so the TTS can automatically resample audio.Parameters:

sample_rate (int): Desired sample rate in Hz (e.g., 48000)
channels (int): Desired channel count (1 for mono, 2 for stereo). Default: 1
audio_format (AudioFormat): Desired audio format. Default: PCM_S16

Example:

# Agent sets output format to match audio track
tts.set_output_format(
    sample_rate=48000,
    channels=1,
    audio_format=AudioFormat.PCM_S16
)

async method

Close the TTS service and release resources.Override in subclasses to clean up connections, tasks, etc.

Properties

session_id

str

Unique session identifier (UUID). Automatically generated.

provider_name

str

Name of the TTS provider.

events

EventManager

Event manager for emitting TTS events.

Internal Properties

_desired_sample_rate

int

Target sample rate for output audio. Set via set_output_format(). Default: 16000

_desired_channels

int

Target channel count for output audio. Set via set_output_format(). Default: 1

_desired_format

AudioFormat

Target audio format for output audio. Set via set_output_format(). Default: PCM_S16

Plugin Example

Here’s how the OpenAI plugin implements the TTS interface:

from vision_agents.core.tts.tts import TTS as BaseTTS
from getstream.video.rtc.track_util import PcmData, AudioFormat
from openai import AsyncOpenAI

class TTS(BaseTTS):
    """OpenAI Text-to-Speech implementation."""
    
    def __init__(
        self,
        *,
        api_key: Optional[str] = None,
        model: str = "gpt-4o-mini-tts",
        voice: str = "alloy",
        client: Optional[AsyncOpenAI] = None,
    ) -> None:
        super().__init__(provider_name="openai_tts")
        
        api_key = api_key or os.environ.get("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OPENAI_API_KEY not set")
        
        self.client = client or AsyncOpenAI(api_key=api_key)
        self.model = model
        self.voice = voice
    
    async def stream_audio(self, text: str, *_, **__) -> PcmData:
        """Synthesize the entire speech to a single PCM buffer.
        
        Base TTS handles resampling and event emission.
        """
        resp = await self.client.audio.speech.create(
            model=self.model,
            voice=self.voice,
            input=text,
            response_format="pcm",
        )
        
        return PcmData.from_bytes(
            resp.content,
            sample_rate=24_000,
            channels=1,
            format=AudioFormat.S16
        )
    
    async def stop_audio(self) -> None:
        # No internal playback queue; agent manages output track
        return None

Streaming Example

For services that support streaming synthesis (like Cartesia or ElevenLabs):

class StreamingTTS(BaseTTS):
    async def stream_audio(self, text: str, *_, **__):
        """Stream audio chunks as they're generated."""
        async for chunk in self.client.synthesize_stream(
            text=text,
            voice_id=self.voice_id,
            model_id=self.model_id,
        ):
            # Yield each chunk as PcmData
            yield PcmData.from_bytes(
                chunk.audio,
                sample_rate=chunk.sample_rate,
                channels=1,
                format=AudioFormat.S16
            )

Audio Resampling

The TTS base class automatically handles resampling:

Your stream_audio() returns audio in the provider’s native format
The base class calls set_output_format() with the agent’s requirements
Each audio chunk is automatically resampled before emission

This means you don’t need to handle resampling in your implementation - just return the audio in whatever format your provider gives you.

Performance Metrics

The TTS class automatically tracks and emits performance metrics:

synthesis_time_ms: Time taken to synthesize audio
audio_duration_ms: Duration of generated audio
real_time_factor: Ratio of synthesis time to audio duration
chunk_count: Number of audio chunks emitted
total_audio_bytes: Total bytes of audio generated

These metrics are included in the TTSSynthesisCompleteEvent.

Events

TTS implementations emit the following events:

TTSSynthesisStartEvent - When synthesis begins
TTSAudioEvent - For each audio chunk (automatically emitted by base class)
TTSSynthesisCompleteEvent - When synthesis completes with performance metrics
TTSErrorEvent - When errors occur during synthesis

See Events for more details.

Core

LLM Components

Processors

Edge & Transport

Session Management

Events

Overview

Usage

Constructor

Abstract Methods

Public Methods

Properties

Internal Properties

Plugin Example

Streaming Example

Audio Resampling

Performance Metrics

Events

Build docs developers (and LLMs) love

Core

LLM Components

Processors

Edge & Transport

Session Management

Events

​Overview

​Usage

​Constructor

​Abstract Methods

​Public Methods

​Properties

​Internal Properties

​Plugin Example

​Streaming Example

​Audio Resampling

​Performance Metrics

​Events

Build docs developers (and LLMs) love

Overview

Usage

Constructor

Abstract Methods

Public Methods

Properties

Internal Properties

Plugin Example

Streaming Example

Audio Resampling

Performance Metrics

Events