Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gratitude5dee/wzrd-studio-desktopfinal/llms.txt

Use this file to discover all available pages before exploring further.

WZRD Studio has a built-in voice control layer powered by the OpenAI Realtime API. Every authenticated page in the app — from the project timeline to the Kanvas generative canvas to the QCut editor — can respond to spoken commands. Pages register their own voice actions at mount time, and a global set of navigation actions is always available. The system is designed so developers can expose new actions on any page with a single hook call.

Architecture Overview

VoiceAgentProvider

Wraps the authenticated app shell. Creates the VoiceActionRegistry, registers global navigation actions, boots the Realtime session, and renders the VoiceActionButton UI element.

VoiceActionRegistry

A Map of VoiceActionNameVoiceActionRegistration[]. Registrations are stacked — the most recently registered handler wins. Deregisters automatically when the registering component unmounts.

useWzrdRealtimeSession

Manages the WebRTC transport to the OpenAI Realtime API. Handles push-to-talk, tool-call dispatch, status transitions, and error normalization.

VoiceSelectionContext

Provides scrollVoiceTargetIntoView and useVoiceSelection for highlighting and selecting UI elements by voice command — used by the timeline, shot grid, and editor panels.

VoiceAgentProvider

VoiceAgentProvider is the root context provider for the entire voice system. It lives inside the authenticated router shell and wraps VoiceSelectionProvider:
// src/voice/VoiceAgentProvider.tsx (simplified)
export function VoiceAgentProvider({ children }: { children: React.ReactNode }) {
  const registry = useMemo(() => createVoiceActionRegistry(), []);
  const navigate = useNavigate();
  const location = useLocation();
  const { isAuthenticated } = useAuth();

  // Recompute global navigation actions on every route change...
  const globalActions = useMemo(
    () =>
      createGlobalVoiceActions({
        navigate,
        getLocationPath: () => `${location.pathname}${location.search}`,
        getCurrentProjectId: () => getProjectIdFromPath(location.pathname),
        getAvailableActionNames: () =>
          Array.from(new Set(registry.list().map((r) => r.name))).sort(),
      }),
    [location.pathname, location.search, navigate, registry],
  );

  // ...and register them, cleaning up the previous set on each route change
  useEffect(() => {
    const unregister = globalActions.map((registration) => registry.register(registration));
    return () => unregister.forEach((fn) => fn());
  }, [globalActions, registry]);

  // Boot the OpenAI Realtime session
  const voiceSession = useWzrdRealtimeSession({ registry });

  // Only show the mic button on authenticated pages (not / or /login)
  const showVoiceControl = shouldShowVoiceControl(location.pathname, isAuthenticated);

  return (
    <VoiceAgentContext.Provider value={registry}>
      <VoiceSelectionProvider>
        {children}
        {showVoiceControl ? (
          <VoiceActionButton
            status={voiceSession.status}
            errorMessage={voiceSession.errorMessage}
            onPressStart={voiceSession.pushToTalkStart}
            onPressEnd={voiceSession.pushToTalkStop}
            onDisconnect={voiceSession.disconnect}
          />
        ) : null}
      </VoiceSelectionProvider>
    </VoiceAgentContext.Provider>
  );
}
The VoiceActionButton is rendered on all authenticated routes except / and /login.

VoiceActionRegistry

The registry is created once with createVoiceActionRegistry() and exposed through context. It supports stacked registrations — if two components register the same action name, the most recently registered handler is called. When the component unmounts, its registration is automatically removed and the previous handler takes over.
interface VoiceActionRegistry {
  register: <Input = unknown>(registration: VoiceActionRegistration<Input>) => () => void;
  execute: (
    name: VoiceActionName,
    input?: unknown,
    context?: VoiceActionExecutionContext,
  ) => Promise<VoiceActionResult>;
  list: () => VoiceActionRegistration[];
  clear: () => void;
}

Action Types

VoiceActionName

Every registerable action name is a member of the VoiceActionName union type. This gives full TypeScript autocomplete and prevents typos at the registration site.
export type VoiceActionName =
  | 'get_app_context'
  | 'navigate_app'
  | 'start_new_project'
  | 'timeline_select_shot'
  | 'timeline_generate_shot_image'
  | 'timeline_generate_all_images'
  | 'timeline_start_directors_cut'
  | 'kanvas_set_studio'
  | 'kanvas_generate'
  | 'editor_import_media_by_url'
  | 'editor_add_clip'
  | 'editor_split_element'
  | 'editor_delete_element'
  | 'editor_add_title'
  | 'editor_export'
  // ... and many more (see src/voice/actions/registry.ts)

VoiceActionResult

Every handler must return a VoiceActionResult, which is a discriminated union:
type VoiceActionResult =
  | {
      ok: true;
      status: 'completed';
      message: string;   // Spoken back to the user
      data?: unknown;    // Optional structured payload
    }
  | {
      ok: false;
      status: 'needs_confirmation' | 'unavailable' | 'invalid_input' | 'failed';
      message: string;
      data?: unknown;
      confirmation?: VoiceActionConfirmation;
      errorCode?: string;
    };

VoiceActionRisk

Actions that modify state can declare a risk level to require user confirmation before execution:
type VoiceActionRisk = 'navigation' | 'write' | 'generation' | 'sensitive';

VoiceActionConfirmation

When a handler returns { status: 'needs_confirmation' }, the confirmation field contains:
interface VoiceActionConfirmation {
  actionName: VoiceActionName;  // The action that needs confirming
  risk: VoiceActionRisk;
  message: string;              // Spoken to the user: "Are you sure you want to…?"
  input: unknown;               // Original input, echoed back for re-execution
}
When a confirmation is declared in the registration, the registry returns { status: 'needs_confirmation' } on the first call. The agent asks the user to confirm, then re-calls with context.confirmed = true.

Registering Voice Actions

useRegisterVoiceActions

Pages register their local actions with the useRegisterVoiceActions hook. Registrations are automatically cleaned up when the component unmounts.
export function useRegisterVoiceActions(registrations: VoiceActionRegistration[]) {
  const registry = useContext(VoiceAgentContext);

  useEffect(() => {
    if (!registry) return;
    const unregister = registrations.map((registration) => registry.register(registration));
    return () => {
      unregister.forEach((fn) => fn());
    };
  }, [registry, registrations]);
}

VoiceActionRegistration

interface VoiceActionRegistration<Input = unknown> {
  name: VoiceActionName;       // Must be a known VoiceActionName
  scope: string;               // Descriptive string, e.g. 'timeline-page'
  description?: string;        // Shown in the agent's tool definition
  confirmation?: {
    risk: VoiceActionRisk;
    message: string;           // Spoken to user: "Are you sure you want to…?"
  };
  handler: VoiceActionHandler<Input>;
}

Example: Registering a Custom Voice Action

import { useRegisterVoiceActions } from '@/voice/VoiceAgentProvider';

function TimelinePage() {
  const { triggerGeneration } = useGeneration();

  useRegisterVoiceActions([
    {
      name: 'timeline_generate_all_images',
      scope: 'timeline-page',
      description: 'Generate all shot images for the current project timeline',
      confirmation: {
        risk: 'generation',
        message: 'This will use credits to generate all shot images. Continue?',
      },
      handler: async (input, context) => {
        if (!context.currentProjectId) {
          return {
            ok: false,
            status: 'invalid_input',
            message: 'No project is open. Please open a project first.',
          };
        }
        await triggerGeneration(context.currentProjectId);
        return {
          ok: true,
          status: 'completed',
          message: 'Shot image generation started.',
        };
      },
    },
  ]);

  // ...
}

Accessing the Registry Directly

Use useVoiceActionRegistry() to read or execute actions imperatively from child components:
import { useVoiceActionRegistry } from '@/voice/VoiceAgentProvider';

function DebugPanel() {
  const registry = useVoiceActionRegistry();

  const listAll = () => {
    const names = registry.list().map((r) => r.name);
    console.log('Registered actions:', names);
  };

  return <button onClick={listAll}>List voice actions</button>;
}
createGlobalVoiceActions produces the baseline set of actions that are always available regardless of which page is open. These are re-registered on every route change so they always reflect the current location and project context.
Action nameWhat it does
get_app_contextReturns current locationPath, currentProjectId, and availableActions array
navigate_appNavigates to any VoiceNavigationTarget — home, kanvas, project timeline, editor, Directors’ Cut, etc.
start_new_projectOpens the project setup page
open_project_viewOpens a specific view within the current project (timeline, editor, studio, observability)
open_ip_vaultOpens the IP Vault page
character_openOpens Kanvas with character-creation studio
kanvas_set_studioOpens Kanvas with a specific studio mode and optional text prompt
Navigation targets include all major app surfaces:
type VoiceNavigationTarget =
  | 'home' | 'project_setup' | 'assets' | 'ip_vault'
  | 'kanvas' | 'kanvas_image' | 'kanvas_video' | 'kanvas_edit'
  | 'kanvas_cinema' | 'kanvas_lipsync' | 'kanvas_character_creation'
  | 'project_studio' | 'project_timeline' | 'project_editor'
  | 'project_directors_cut' | 'project_observability'
  | 'settings_billing' | 'learning_studio';

Realtime Session

useWzrdRealtimeSession manages the full lifecycle of the OpenAI Realtime WebRTC session. It exposes push-to-talk controls and a typed VoiceSessionStatus.
type VoiceSessionStatus =
  | 'idle'        // No session active
  | 'connecting'  // WebRTC handshake in progress
  | 'connected'   // Ready, microphone off
  | 'listening'   // Microphone active, capturing speech
  | 'thinking'    // Model processing input / executing tool calls
  | 'speaking'    // Model audio playing back
  | 'confirming'  // Awaiting user confirmation for a risky action
  | 'error';      // Session error — see errorMessage

Push-to-Talk API

const {
  status,
  errorMessage,
  pushToTalkStart,  // Call on mic button press
  pushToTalkStop,   // Call on mic button release
  disconnect,       // Tear down the WebRTC session
} = useWzrdRealtimeSession({ registry });
pushToTalkStart lazily connects the session on the first press. Subsequent presses interrupt any in-progress model response before capturing new input. pushToTalkStop commits the audio buffer and sends response.create to the model.

Session Configuration

The session is initialized with the app’s voice instructions, all registered tool definitions, and turn_detection: null (push-to-talk mode). Transcription uses gpt-4o-mini-transcribe.
sessionConfig: {
  modalities: ['text', 'audio'],
  voice: 'marin',   // VITE_WZRD_REALTIME_VOICE
  instructions: getVoiceInstructions(),
  tools: getVoiceToolDefinitions(registry),
  tool_choice: 'auto',
  turn_detection: null,
  input_audio_transcription: { model: 'gpt-4o-mini-transcribe' },
}

Authentication & API Key

The session key is fetched through the realtime-client-secret Supabase Edge Function, keeping your OpenAI key off the client:
// Internally called by useWzrdRealtimeSession
const sessionInfo = await fetchRealtimeClientSecret();
// → { clientSecret: '...', model: 'gpt-4o-realtime-preview' }
The API key is never passed directly from the client. useWzrdRealtimeSession always calls the realtime-client-secret Supabase Edge Function to obtain a short-lived ephemeral key — never expose your OpenAI service key in the renderer process.

VoiceSelectionContext

Pages that need to highlight or scroll UI elements in response to voice commands use VoiceSelectionContext:
import { useVoiceSelection } from '@/voice/VoiceSelectionContext';

function ShotCard({ shotId }: { shotId: string }) {
  const ref = useRef<HTMLDivElement>(null);
  const { registerTarget } = useVoiceSelection();

  useEffect(() => {
    return registerTarget(shotId, ref);
  }, [shotId]);

  return <div ref={ref}>{/* ... */}</div>;
}
scrollVoiceTargetIntoView(id) scrolls the registered element into the viewport and applies a brief selection highlight — used by timeline_select_shot, ip_vault_select_item, and similar actions.

Test Harness

In development builds with VITE_BYPASS_AUTH_FOR_TESTS=true, a global test harness is attached to window:
window.__wzrdVoiceActionTest = {
  execute: (name, input?, options?) => registry.execute(name, input, options),
};
This lets you invoke any registered voice action from the browser console or from Playwright tests without speaking:
// In browser DevTools or a Playwright test
await window.__wzrdVoiceActionTest.execute('navigate_app', { target: 'kanvas_image' });
// → { ok: true, status: 'completed', message: 'Opened kanvas_image.', data: { path: '/kanvas?studio=image' } }
The __wzrdVoiceActionTest harness is only attached when both import.meta.env.DEV is true and VITE_BYPASS_AUTH_FOR_TESTS is "true". It is stripped from production builds.

Out-of-Band Narration

Pages can push contextual narration to the voice agent without the user pressing the mic button. Fire the wzrd:voice-oob-narrate custom DOM event:
window.dispatchEvent(
  new CustomEvent('wzrd:voice-oob-narrate', {
    detail: {
      text: 'Storyline generation is complete. Your project has 12 scenes and 48 shots.',
      topic: 'storyline_stream',
    },
  })
);
The active Realtime session will speak the update as a single concise highlight. If no session is connected, the event is silently ignored.

Build docs developers (and LLMs) love