TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Bijit-Mondal/VoiceAgent/llms.txt
Use this file to discover all available pages before exploring further.
VideoAgent extends voice capabilities with vision, allowing your agent to see what users are showing via webcam and respond intelligently.
Complete Video Server Example
import "dotenv/config";
import { WebSocketServer } from "ws";
import { VideoAgent } from "voice-agent-ai";
import { tool } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";
import { mkdirSync, writeFileSync } from "fs";
import { join } from "path";
// Optional: Save frames to disk for debugging
const FRAMES_DIR = join(__dirname, "frames");
mkdirSync(FRAMES_DIR, { recursive: true });
let frameCounter = 0;
function saveFrame(msg: {
sequence?: number;
timestamp?: number;
triggerReason?: string;
image: { data: string; format?: string; width?: number; height?: number };
}) {
const idx = frameCounter++;
const ext = msg.image.format === "jpeg" ? "jpg" : (msg.image.format || "webp");
const filename = `frame_${String(idx).padStart(5, "0")}.${ext}`;
const filepath = join(FRAMES_DIR, filename);
const buf = Buffer.from(msg.image.data, "base64");
writeFileSync(filepath, buf);
console.log(
`[frames] Saved ${filename} (${(buf.length / 1024).toFixed(1)} kB, ` +
`${msg.image.width}×${msg.image.height}, ${msg.triggerReason})`
);
}
const endpoint = process.env.VIDEO_WS_ENDPOINT || "ws://localhost:8081";
const url = new URL(endpoint);
const port = Number(url.port || 8081);
const host = url.hostname || "localhost";
// Define tools
const weatherTool = tool({
description: "Get the weather in a location",
inputSchema: z.object({
location: z.string().describe("The location to get the weather for"),
}),
execute: async ({ location }) => ({
location,
temperature: 72 + Math.floor(Math.random() * 21) - 10,
conditions: ["sunny", "cloudy", "rainy", "partly cloudy"][
Math.floor(Math.random() * 4)
],
}),
});
const timeTool = tool({
description: "Get the current time",
inputSchema: z.object({}),
execute: async () => ({
time: new Date().toLocaleTimeString(),
timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
}),
});
const wss = new WebSocketServer({ port, host });
wss.on("listening", () => {
console.log(`[video-ws] listening on ${endpoint}`);
console.log(`[video-ws] Connect your video client to ${endpoint}`);
});
wss.on("connection", (socket) => {
console.log("[video-ws] ✓ client connected");
const agent = new VideoAgent({
model: openai("gpt-4o"), // Vision-enabled model required
transcriptionModel: openai.transcription("whisper-1"),
speechModel: openai.speech("gpt-4o-mini-tts"),
instructions: `You are a helpful video+voice assistant.
You can SEE what the user is showing via webcam.
Describe what you see when it helps answer the question.
Keep spoken answers concise and natural.`,
voice: "echo",
streamingSpeech: {
minChunkSize: 25,
maxChunkSize: 140,
parallelGeneration: true,
maxParallelRequests: 3,
},
tools: { getWeather: weatherTool, getTime: timeTool },
// Video-specific configuration
maxContextFrames: 6, // Keep last 6 frames in context
maxFrameInputSize: 2_500_000, // ~2.5 MB max frame size
});
// Text and streaming events
agent.on("text", (data: { role: string; text: string }) => {
console.log(`[video] Text (${data.role}): ${data.text?.substring(0, 100)}...`);
});
agent.on("chunk:text_delta", (data: { text: string }) => {
process.stdout.write(data.text || "");
});
// Video frame events
agent.on("frame_received", ({ sequence, size, dimensions, triggerReason }) => {
console.log(
`[video] Frame #${sequence} (${triggerReason}) ` +
`${size / 1024 | 0} kB ${dimensions.width}×${dimensions.height}`
);
});
agent.on("frame_requested", ({ reason }) => {
console.log(`[video] Requested frame: ${reason}`);
});
// Audio and transcription events
agent.on("audio_received", ({ size, format }) => {
console.log(`[video] Audio received: ${size} bytes, format: ${format}`);
});
agent.on("transcription", ({ text, language }) => {
console.log(`[video] Transcription: "${text}" (${language || "unknown"})`);
});
// Speech events
agent.on("speech_start", () => console.log(`[video] Speech started`));
agent.on("speech_complete", () => console.log(`[video] Speech complete`));
agent.on("audio_chunk", ({ chunkId, text }) => {
console.log(`[video] Audio chunk #${chunkId}: "${text?.substring(0, 50)}..."`);
});
// Error handling
agent.on("error", (error: Error) => {
console.error(`[video] ERROR:`, error);
});
agent.on("warning", (warning: string) => {
console.warn(`[video] WARNING:`, warning);
});
agent.on("disconnected", () => {
agent.destroy();
console.log("[video-ws] ✗ client disconnected (agent destroyed)");
});
// Intercept raw messages to save frames to disk (optional)
socket.on("message", (raw) => {
try {
const msg = JSON.parse(raw.toString());
if (msg.type === "video_frame" && msg.image?.data) {
saveFrame(msg);
}
} catch {
// Not JSON - ignore, agent will handle binary etc.
}
});
// Hand socket to agent
agent.handleSocket(socket);
});
process.on("SIGINT", () => {
console.log("\n[video-ws] Shutting down...");
wss.close(() => {
console.log("[video-ws] Server closed");
process.exit(0);
});
});
Vision-Enabled Models
The VideoAgent requires a vision-capable model:OpenAI
import { openai } from "@ai-sdk/openai";
const agent = new VideoAgent({
model: openai("gpt-4o"), // ✅ Supports vision
// model: openai("gpt-4-turbo"), // ✅ Also supports vision
// ...
});
Anthropic
import { anthropic } from "@ai-sdk/anthropic";
const agent = new VideoAgent({
model: anthropic("claude-3.5-sonnet-20241022"), // ✅ Supports vision
// ...
});
import { google } from "@ai-sdk/google";
const agent = new VideoAgent({
model: google("gemini-1.5-flash"), // ✅ Supports vision
// model: google("gemini-1.5-pro"), // ✅ Also supports vision
// ...
});
Frame Management
Configuration Options
const agent = new VideoAgent({
// Maximum frames to keep in context buffer
// Higher = more visual history, more tokens used
maxContextFrames: 6, // Default: 10
// Maximum frame size in bytes
// Larger frames = better quality, more bandwidth
maxFrameInputSize: 2_500_000, // Default: 5 MB
// ...
});
Frame Context Buffer
The agent maintains a rolling buffer of recent frames:// Get current frame context
const frames = agent.getFrameContext();
console.log(`Buffered frames: ${frames.length}`);
frames.forEach(frame => {
console.log(`Frame #${frame.sequence}: ${frame.triggerReason}`);
});
// Clear frame history
agent.clearHistory(); // Also clears frame buffer
Requesting Frames
You can programmatically request the client to capture a frame:// Request frame with reason
agent.requestFrameCapture("user_request");
// Trigger reasons:
// - "scene_change": Automatic capture on scene change
// - "user_request": Manual request from server
// - "timer": Periodic capture
// - "initial": First frame on connection
Client WebSocket Messages
Video Frame
socket.send(JSON.stringify({
type: "video_frame",
sessionId: "session_123",
sequence: 1,
timestamp: Date.now(),
triggerReason: "scene_change",
image: {
data: base64EncodedImage,
format: "webp", // or "jpeg", "png"
width: 640,
height: 480
}
}));
Audio (same as VoiceAgent)
socket.send(JSON.stringify({
type: "audio",
sessionId: "session_123",
data: base64AudioData,
format: "mp3",
timestamp: Date.now()
}));
Text Transcript
socket.send(JSON.stringify({
type: "transcript",
text: "What am I holding?"
}));
Server Responses
In addition to standard VoiceAgent responses, VideoAgent sends:Frame Acknowledgment
{
"type": "frame_ack",
"sequence": 1,
"timestamp": 1234567890
}
Frame Request
{
"type": "capture_frame",
"reason": "user_request",
"timestamp": 1234567890
}
Session Initialization
{
"type": "session_init",
"sessionId": "vs_abc123_xyz789"
}
Usage Patterns
Visual Question Answering
// User asks: "What color is this?"
// Agent automatically uses latest frame from buffer
// Response: "That's a blue coffee mug."
Scene Understanding
// Agent can describe scenes
// User: "Describe what you see"
// Response: "I can see a desk with a laptop, a coffee mug,
// and some papers. The room appears to be an office."
Object Detection
// User: "How many people are in the room?"
// Agent analyzes frame and counts
// Response: "I can see 3 people in the frame."
Visual Context in Conversations
// User: "What's the weather like?" (shows window view)
// Agent sees sunny sky in frame
// Response: "Based on what I can see through your window,
// it looks sunny! Let me check the forecast..."
// [Tool call: getWeather]
Performance Optimization
Token Usage
Each frame adds approximately 100-400 tokens depending on resolution:// Conservative: fewer frames, less cost
maxContextFrames: 3,
// Balanced: good history, moderate cost
maxContextFrames: 6,
// Rich history: more context, higher cost
maxContextFrames: 10,
Frame Quality vs Bandwidth
// Lower quality, faster transmission
maxFrameInputSize: 500_000, // 500 KB
// Balanced
maxFrameInputSize: 2_500_000, // 2.5 MB
// High quality
maxFrameInputSize: 5_000_000, // 5 MB (default)
Frame Capture Strategy
On the client side, optimize when to send frames:// Option 1: On scene change (smart, efficient)
if (sceneChangeDetected()) {
captureAndSendFrame("scene_change");
}
// Option 2: On user speech (contextual)
audioStream.on("start", () => {
captureAndSendFrame("user_request");
});
// Option 3: Periodic (predictable, may be wasteful)
setInterval(() => {
captureAndSendFrame("timer");
}, 5000); // Every 5 seconds
Example Output
[video-ws] listening on ws://localhost:8081
[video-ws] Connect your video client to ws://localhost:8081
[video-ws] ✓ client connected
[video] Frame #1 (initial) 234 kB 640×480
[frames] Saved frame_00001.webp (234.5 kB, 640×480, initial)
[video] Audio received: 45320 bytes, format: mp3
[video] Transcription: "What am I holding?" (en)
[video] Frame #2 (user_request) 228 kB 640×480
[frames] Saved frame_00002.webp (228.3 kB, 640×480, user_request)
[video] Text (user): What am I holding?
[video] Speech started
You're holding a blue coffee mug with a white handle.
[video] Audio chunk #1: "You're holding a blue coffee mug with a white..."
[video] Speech complete
Next Steps
- Basic Usage - VoiceAgent fundamentals
- Custom Tools - Add custom functionality
- API Reference - Full VideoAgent API