Overview
TheTTS (Text-to-Speech) class is an abstract base class for implementing speech synthesis in Vision Agents. It handles text-to-audio conversion, audio resampling, event emission, and performance tracking.
Location: vision_agents.core.tts.tts.TTS
Usage
Constructor
Name of the TTS provider (e.g., “openai”, “elevenlabs”, “cartesia”). If not provided, uses the class name.
Abstract Methods
Convert text to speech audio data.This method must be implemented by subclasses. It can return audio in several formats:Example (streaming):
- Single
PcmDataobject (entire audio) - Iterator of
PcmDataobjects (streaming) - Async iterator of
PcmDataobjects (async streaming)
text(str): The text to convert to speech*args: Additional arguments**kwargs: Additional keyword arguments
PcmData | Iterator[PcmData] | AsyncIterator[PcmData]Example (single buffer):Stop audio synthesis and clear any queues.This method is called when:
- User interrupts the agent
- Turn detection indicates user started speaking
- Agent needs to stop talking
- Cancel ongoing synthesis requests
- Clear internal audio buffers
- Stop any playback tasks
Public Methods
Convert text to speech and emit audio events.This is the main method used by agents. It:
- Calls
stream_audio()to get audio - Resamples audio to the desired format
- Emits
TTSAudioEventfor each chunk - Tracks performance metrics
- Emits
TTSSynthesisCompleteEventwhen done
text(str): The text to convert to speechparticipant(Optional[Participant]): Participant to associate with the audio*args: Additional arguments passed tostream_audio()**kwargs: Additional keyword arguments passed tostream_audio()
Set the desired output audio format for emitted events.The agent should call this with its output track properties so the TTS can automatically resample audio.Parameters:
sample_rate(int): Desired sample rate in Hz (e.g., 48000)channels(int): Desired channel count (1 for mono, 2 for stereo). Default: 1audio_format(AudioFormat): Desired audio format. Default: PCM_S16
Close the TTS service and release resources.Override in subclasses to clean up connections, tasks, etc.
Properties
Unique session identifier (UUID). Automatically generated.
Name of the TTS provider.
Event manager for emitting TTS events.
Internal Properties
Target sample rate for output audio. Set via
set_output_format(). Default: 16000Target channel count for output audio. Set via
set_output_format(). Default: 1Target audio format for output audio. Set via
set_output_format(). Default: PCM_S16Plugin Example
Here’s how the OpenAI plugin implements the TTS interface:Streaming Example
For services that support streaming synthesis (like Cartesia or ElevenLabs):Audio Resampling
The TTS base class automatically handles resampling:- Your
stream_audio()returns audio in the provider’s native format - The base class calls
set_output_format()with the agent’s requirements - Each audio chunk is automatically resampled before emission
Performance Metrics
The TTS class automatically tracks and emits performance metrics:- synthesis_time_ms: Time taken to synthesize audio
- audio_duration_ms: Duration of generated audio
- real_time_factor: Ratio of synthesis time to audio duration
- chunk_count: Number of audio chunks emitted
- total_audio_bytes: Total bytes of audio generated
TTSSynthesisCompleteEvent.
Events
TTS implementations emit the following events:TTSSynthesisStartEvent- When synthesis beginsTTSAudioEvent- For each audio chunk (automatically emitted by base class)TTSSynthesisCompleteEvent- When synthesis completes with performance metricsTTSErrorEvent- When errors occur during synthesis