Documentation Index Fetch the complete documentation index at: https://mintlify.com/Bijit-Mondal/VoiceAgent/llms.txt
Use this file to discover all available pages before exploring further.
Both VoiceAgent and VideoAgent extend Node.js EventEmitter and emit events throughout the lifecycle of conversation processing. This page documents all available events, their payloads, and when they’re triggered.
Event Categories
Text Events User input and LLM text streaming
Speech Events TTS generation and audio output
Tool Events Tool invocations and results
Connection Events WebSocket lifecycle
Video Events Frame capture and processing (VideoAgent only)
Error Events Errors and warnings
Text Events
Events related to text input and LLM streaming output.
text
Emitted when user input is received or when the full assistant response is ready.
Payload:
{
role : "user" | "assistant" ;
text : string ;
hasImage ?: boolean ; // VideoAgent only
}
When:
role: "user" - After user sends text input or audio is transcribed
role: "assistant" - After LLM completes full response
Example:
agent . on ( "text" , ({ role , text }) => {
const prefix = role === "user" ? "👤" : "🤖" ;
console . log ( ` ${ prefix } ${ text } ` );
});
chunk:text_delta
Emitted for each streaming text token from the LLM.
Payload:
{
id : string ;
text : string ;
}
When: During LLM streaming response, once per token.
Example:
agent . on ( "chunk:text_delta" , ({ text }) => {
process . stdout . write ( text ); // Stream to console
});
chunk:reasoning_delta
Emitted for each reasoning token (for models that support reasoning).
Payload:
{
id : string ;
text : string ;
}
When: During reasoning phase of models like o1.
Example:
agent . on ( "chunk:reasoning_delta" , ({ text }) => {
console . log ( "[Reasoning]" , text );
});
Speech Events
Events related to text-to-speech generation and audio output.
speech_start
Emitted when TTS generation begins.
Payload:
When: When the first text chunk is sent to the speech model.
Example:
agent . on ( "speech_start" , ({ streaming }) => {
console . log ( `Speech started (streaming: ${ streaming } )` );
});
speech_complete
Emitted when all TTS chunks have been sent.
Payload:
When: After all speech chunks are generated and sent to the client.
Example:
agent . on ( "speech_complete" , ({ streaming }) => {
console . log ( "All speech chunks complete" );
});
speech_interrupted
Emitted when speech generation is cancelled.
Payload:
Common reasons:
"interrupted" - Manual interruption via interruptSpeech()
"user_speaking" - User started speaking (barge-in)
"client_request" - Client sent interrupt message
"disconnected" - WebSocket disconnected
Example:
agent . on ( "speech_interrupted" , ({ reason }) => {
console . log ( `Speech interrupted: ${ reason } ` );
});
speech_chunk_queued
Emitted when a text chunk enters the TTS queue.
Payload:
{
id : number ;
text : string ;
}
When: After text is split into chunks and queued for TTS generation.
Example:
agent . on ( "speech_chunk_queued" , ({ id , text }) => {
console . log ( `Chunk ${ id } queued: " ${ text . substring ( 0 , 50 ) } ..."` );
});
audio_chunk
Emitted when a single TTS chunk is ready and sent.
Payload:
{
chunkId : number ;
data : string ; // Base64-encoded audio
format : string ; // e.g., "mp3", "opus"
text : string ; // Original text for this chunk
uint8Array : Uint8Array ; // Raw audio bytes
}
When: After each chunk is generated by the speech model.
Example:
agent . on ( "audio_chunk" , ({ chunkId , format , uint8Array , text }) => {
console . log ( `Chunk ${ chunkId } : ${ uint8Array . length } bytes ( ${ format } )` );
// Save or stream audio
fs . writeFileSync ( `chunk- ${ chunkId } . ${ format } ` , uint8Array );
});
audio
Emitted for full non-streaming TTS audio.
Payload:
{
data : string ; // Base64-encoded audio
format : string ; // e.g., "mp3", "opus"
uint8Array : Uint8Array ; // Raw audio bytes
}
When: When using generateAndSendSpeechFull() instead of streaming.
Example:
agent . on ( "audio" , ({ uint8Array , format }) => {
fs . writeFileSync ( `response. ${ format } ` , uint8Array );
});
Events related to AI SDK tool invocations.
Emitted when a tool invocation is detected during streaming.
Payload:
{
toolName : string ;
toolCallId : string ;
input : any ; // Tool input parameters
}
When: When LLM decides to call a tool.
Example:
agent . on ( "chunk:tool_call" , ({ toolName , toolCallId , input }) => {
console . log ( `Calling tool: ${ toolName } ` , input );
});
Emitted when a tool execution completes.
Payload:
{
name : string ;
toolCallId : string ;
result : any ; // Tool output
}
When: After tool’s execute function finishes.
Example:
agent . on ( "tool_result" , ({ name , toolCallId , result }) => {
console . log ( `Tool ${ name } result:` , result );
});
Transcription Events
Events related to audio transcription.
transcription
Emitted when audio is successfully transcribed to text.
Payload:
{
text : string ;
language ?: string ;
}
When: After transcribeAudio() or audio WebSocket message is processed.
Example:
agent . on ( "transcription" , ({ text , language }) => {
console . log ( `Transcribed ( ${ language || "unknown" } ): ${ text } ` );
});
audio_received
Emitted when raw audio input is received before transcription.
Payload:
{
size : number ; // Audio size in bytes
}
When: After audio WebSocket message arrives, before transcription starts.
Example:
agent . on ( "audio_received" , ({ size }) => {
console . log ( `Received ${ ( size / 1024 ). toFixed ( 1 ) } KB of audio` );
});
History Events
Events related to conversation memory management.
history_cleared
Emitted when conversation history is manually cleared.
Payload: None
When: After clearHistory() is called.
Example:
agent . on ( "history_cleared" , () => {
console . log ( "Conversation history cleared" );
});
history_trimmed
Emitted when old messages are automatically removed from history.
Payload:
{
removedCount : number ;
reason : "max_messages" | "max_chars" ;
}
When: When history exceeds maxMessages or maxTotalChars limits.
Example:
agent . on ( "history_trimmed" , ({ removedCount , reason }) => {
console . log ( `Trimmed ${ removedCount } messages (reason: ${ reason } )` );
});
Connection Events
WebSocket lifecycle events.
connected
Emitted when WebSocket connection is established.
Payload: None
When: After connect() succeeds or handleSocket() is called.
Example:
agent . on ( "connected" , () => {
console . log ( "WebSocket connected" );
});
disconnected
Emitted when WebSocket connection closes.
Payload: None
When: When socket closes (client disconnect, network error, disconnect() called).
Example:
agent . on ( "disconnected" , () => {
console . log ( "WebSocket disconnected" );
agent . destroy (); // Clean up resources
});
Video Events (VideoAgent Only)
Events specific to VideoAgent for video frame processing.
frame_received
Emitted when a video frame is received and processed.
Payload:
{
sequence : number ;
timestamp : number ;
triggerReason : FrameTriggerReason ;
size : number ; // Frame size in bytes
dimensions : {
width : number ;
height : number ;
};
}
When: After frame passes validation and is added to context buffer.
Example:
videoAgent . on ( "frame_received" , ({ sequence , triggerReason , size , dimensions }) => {
console . log ( `Frame ${ sequence } received ( ${ triggerReason } ): ${ dimensions . width } x ${ dimensions . height } , ${ ( size / 1024 ). toFixed ( 1 ) } KB` );
});
frame_requested
Emitted when the agent requests the client to capture a frame.
Payload:
{
reason : FrameTriggerReason ;
}
When: After requestFrameCapture() is called.
Example:
videoAgent . on ( "frame_requested" , ({ reason }) => {
console . log ( `Requesting frame capture: ${ reason } ` );
});
client_ready
Emitted when client connects and reports capabilities.
Payload:
any // Client-reported capabilities object
When: After receiving client_ready WebSocket message.
Example:
videoAgent . on ( "client_ready" , ( capabilities ) => {
console . log ( "Client ready with capabilities:" , capabilities );
});
config_changed
Emitted when video agent configuration is updated.
Payload:
When: After updateConfig() is called.
Example:
videoAgent . on ( "config_changed" , ( config ) => {
console . log ( "Config updated:" , config );
});
Error Events
Error and warning events.
error
Emitted when an error occurs in any subsystem.
Payload:
Common sources:
LLM stream errors
TTS generation failures
Transcription errors
WebSocket errors
Invalid input (oversized audio/frames)
Example:
agent . on ( "error" , ( error ) => {
console . error ( "Agent error:" , error . message );
// Handle error gracefully
});
warning
Emitted for non-fatal issues that don’t stop processing.
Payload:
string // Warning message
Common warnings:
Empty transcript message
Invalid audio message
Empty video frame
Example:
agent . on ( "warning" , ( message ) => {
console . warn ( "Warning:" , message );
});
Listening to Events
Basic Event Handling
const agent = new VoiceAgent ({
model: openai ( "gpt-4o" ),
// ... other options
});
// Text streaming
agent . on ( "chunk:text_delta" , ({ text }) => {
process . stdout . write ( text );
});
// Speech events
agent . on ( "speech_start" , () => console . log ( "🔊 Speaking..." ));
agent . on ( "speech_complete" , () => console . log ( "✅ Done speaking" ));
// Tool usage
agent . on ( "chunk:tool_call" , ({ toolName , input }) => {
console . log ( `🔧 Calling ${ toolName } :` , input );
});
agent . on ( "tool_result" , ({ name , result }) => {
console . log ( `✓ ${ name } returned:` , result );
});
// Errors
agent . on ( "error" , ( error ) => {
console . error ( "❌ Error:" , error );
});
WebSocket Integration
import { WebSocketServer } from "ws" ;
const wss = new WebSocketServer ({ port: 8080 });
wss . on ( "connection" , ( socket ) => {
const agent = new VoiceAgent ({
model: openai ( "gpt-4o" ),
transcriptionModel: openai . transcription ( "whisper-1" ),
speechModel: openai . speech ( "gpt-4o-mini-tts" ),
});
agent . handleSocket ( socket );
// Forward events to client
agent . on ( "text" , ( data ) => {
socket . send ( JSON . stringify ({ type: "text" , ... data }));
});
agent . on ( "audio_chunk" , ({ chunkId , data , format }) => {
socket . send ( JSON . stringify ({
type: "audio_chunk" ,
chunkId ,
data ,
format
}));
});
// Cleanup on disconnect
agent . on ( "disconnected" , () => {
agent . destroy ();
});
});
VideoAgent Events
const videoAgent = new VideoAgent ({
model: openai ( "gpt-4o" ), // Vision model
speechModel: openai . speech ( "gpt-4o-mini-tts" ),
});
// Frame events
videoAgent . on ( "frame_received" , ({ sequence , triggerReason }) => {
console . log ( `📸 Frame ${ sequence } ( ${ triggerReason } )` );
});
videoAgent . on ( "frame_requested" , ({ reason }) => {
console . log ( `🎥 Requesting frame: ${ reason } ` );
});
// Multimodal text events include image context
videoAgent . on ( "text" , ({ role , text , hasImage }) => {
console . log ( ` ${ role } : ${ text } ${ hasImage ? "📷" : "" } ` );
});
Event Timing Diagram
Typical event flow for a user query:
1. User Input
└─> text (role: "user")
2. LLM Streaming
├─> chunk:text_delta (multiple)
├─> chunk:tool_call (if tools used)
└─> tool_result (if tools used)
3. Speech Generation
├─> speech_chunk_queued (multiple)
├─> speech_start
├─> audio_chunk (multiple)
└─> speech_complete
4. Response Complete
└─> text (role: "assistant")
Types & Interfaces Type definitions for event payloads
VoiceAgent Voice agent class reference
VideoAgent Video agent class reference
WebSocket Protocol Complete WebSocket message protocol