Overview
TheSTT (Speech-to-Text) class is an abstract base class for implementing speech recognition in Vision Agents. It provides a standardized interface for transcribing audio streams to text with support for partial transcripts, turn detection, and error handling.
Location: vision_agents.core.stt.stt.STT
Usage
Constructor
Name of the STT provider (e.g., “elevenlabs”, “deepgram”, “whisper”). If not provided, uses the class name.
Abstract Methods
Process incoming audio data for transcription.This method is called approximately every 20ms with new audio data. Implementations should:
- Buffer or send audio to the STT service
- Emit partial transcripts as they become available
- Emit final transcripts when speech segments complete
- Emit turn events if turn detection is supported
pcm_data(PcmData): PCM audio data to processparticipant(Participant): Participant who is speaking
Lifecycle Methods
Initialize the STT service and prepare for audio processing.Override this method to:
- Establish connections to STT APIs
- Start background tasks
- Initialize audio buffers
self.started = True and prevents double-start.Example:Clear any pending audio or internal state.Called when:
- User stops speaking (turn ends)
- Agent needs to interrupt
- Conversation needs to be reset
Close the STT service and release resources.Base implementation sets
self.closed = True. Override to:- Close WebSocket connections
- Cancel background tasks
- Clean up buffers
Event Emission Methods
STT implementations must call these methods to emit events:Emit a final transcript event.Parameters:
text(str): The transcribed textparticipant(Participant): Participant metadataresponse(TranscriptResponse): Transcription response metadata
Emit a partial (interim) transcript event.Parameters:
text(str): The partial transcribed textparticipant(Participant): Participant metadataresponse(TranscriptResponse): Transcription response metadata
- Real-time UI updates
- Early turn detection
- Responsive user feedback
Emit an error event for temporary errors.Parameters:
error(Exception): The error that occurredparticipant(Optional[Participant]): Participant metadatacontext(str): Error context description
Turn Detection Methods
If your STT service supports turn detection, setturn_detection = True and emit these events:
Emit an event when a user starts speaking.Parameters:
participant(Participant): Participant who started speakingconfidence(Optional[float]): Confidence of turn detection (0.0-1.0). Default: 0.5
Emit an event when a user stops speaking.Parameters:
participant(Participant): Participant who stopped speakingeager_end_of_turn(bool): Whether this is an early/eager turn end. Default: Falseconfidence(Optional[float]): Confidence of turn detection (0.0-1.0). Default: 0.5
Properties
Whether the STT service has been closed.
Whether the STT service has been started.
Whether this STT implementation supports turn detection. Set to
True in subclasses that support it.Unique session identifier (UUID). Automatically generated.
Name of the STT provider.
Event manager for emitting STT events.
Plugin Example
Here’s how the ElevenLabs plugin implements the STT interface:Events
STT implementations emit the following events:STTTranscriptEvent- Final transcriptSTTPartialTranscriptEvent- Partial/interim transcriptSTTErrorEvent- Errors during processingTurnStartedEvent- User started speaking (if turn detection supported)TurnEndedEvent- User stopped speaking (if turn detection supported)