Audio/Video Transcription

Overview

The transcription feature uses the ElevenLabs Scribe API to convert audio and video files into accurate text transcripts. It supports multiple audio formats, language detection, and various configuration options for precision control.

Supported Audio Formats

The application accepts the following audio and video formats:

type AudioType =
  | "audio/mpeg"
  | "audio/wav"
  | "audio/ogg"
  | "audio/mp3"
  | "audio/m4a"
  | "audio/aac"
  | "audio/webm";

The system automatically detects the audio type from the file extension or MIME type.

Configuration Options

All transcription options are defined in the TranscriptOptions interface:

export type TranscriptOptions = {
  modelId: "scribe_v1" | "scribe_v2";
  languageCode?: string;
  tagAudioEvents: boolean;
  numSpeakers?: number;
  timestampsGranularity: "none" | "word" | "character";
  diarize: boolean;
  diarizationThreshold?: number;
  temperature?: number;
  seed?: number;
  useMultiChannel: boolean;
  keyterms?: string[];
  entityDetection?: string;
};

Key Options

Model Selection
Timestamps
Language

Choose between two transcription models:

scribe_v1: First generation model
scribe_v2: Latest model with improved accuracy (default)

modelId: "scribe_v2"

Control the granularity of timestamp data:

none: No timestamps
word: Word-level timestamps
character: Character-level timestamps (default)

timestampsGranularity: "character"

Optionally specify a language code to improve accuracy:

languageCode: "en"  // English
languageCode: "es"  // Spanish
languageCode: "fr"  // French

Making API Calls

The transcription is performed using the ElevenLabs JavaScript SDK. Here’s the actual implementation from the playground:

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

async function handleTranscribe(file: File, apiKey: string, options: TranscriptOptions) {
  const browserClient = new ElevenLabsClient({ apiKey });
  
  const transcriptResponse = await browserClient.speechToText.convert({
    file,
    modelId: options.modelId || "scribe_v2",
    languageCode: options.languageCode || undefined,
    tagAudioEvents: options.tagAudioEvents || false,
    numSpeakers: options.numSpeakers || undefined,
    timestampsGranularity: options.timestampsGranularity || "character",
    diarize: options.diarize || false,
    diarizationThreshold: options.diarizationThreshold || undefined,
    temperature: options.temperature || undefined,
    seed: options.seed || undefined,
    useMultiChannel: options.useMultiChannel || false,
    keyterms: options.keyterms || undefined,
    entityDetection: options.entityDetection || undefined,
  });
  
  return transcriptResponse;
}

The API call is made from speech-to-text-playground.tsx:48-62. All options are passed directly to the ElevenLabs SDK.

Advanced Options

Temperature & Seed

Control the randomness and reproducibility of transcriptions:

temperature: 0.5  // Range: 0.0-2.0 (lower = more deterministic)
seed: 42          // Fixed seed for reproducible results

Keyterms

Provide domain-specific terms to improve recognition accuracy:

keyterms: ["ElevenLabs", "API", "transcription"]

Keyterms are parsed from comma-separated input:

export function parseKeytermsInput(value: string): string[] | undefined {
  const trimmed = value.trim();
  if (!trimmed) return undefined;
  return trimmed
    .split(",")
    .map((item) => item.trim())
    .filter((item) => item.length > 0);
}

Entity Detection

Detect and optionally redact sensitive information:

entityDetection: "pii"  // Personally Identifiable Information
entityDetection: "phi"  // Protected Health Information
entityDetection: "all"  // All entity types

Audio Event Tagging

Tag non-speech audio events (laughter, applause, etc.):

tagAudioEvents: true

Response Handling

The API returns a SpeechToTextChunkResponseModel containing:

Full transcript text
Word-level data with timestamps
Speaker information (if diarization enabled)
Character-level alignment data

const audioUrl = URL.createObjectURL(file);
const alignment = convertToAlignment(transcriptResponse);

setResult({
  transcript: transcriptResponse,
  audioUrl,
  alignment,
});

Error Handling

The application includes specialized error parsing for ElevenLabs API errors:

export function getElevenLabsErrorMessage(error: unknown): string | undefined {
  if (!error || typeof error !== "object") return undefined;
  const body = Reflect.get(error, "body");
  const parsedBody = typeof body === "string" ? safeParseJson(body) : body;
  if (!parsedBody || typeof parsedBody !== "object") return undefined;

  const detail = Reflect.get(parsedBody, "detail");
  if (typeof detail === "string") return detail;

  if (detail && typeof detail === "object") {
    const detailMessage = Reflect.get(detail, "message");
    if (typeof detailMessage === "string") return detailMessage;
    return JSON.stringify(detail);
  }

  return undefined;
}

Always handle API errors gracefully and provide meaningful feedback to users about authentication issues, file format problems, or quota limits.

Default Configuration

The playground uses these defaults:

const defaultTranscriptOptions: TranscriptOptions = {
  modelId: "scribe_v2",
  tagAudioEvents: false,
  timestampsGranularity: "character",
  diarize: false,
  useMultiChannel: false,
};

Next Steps

Learn about Speaker Diarization to identify different speakers
Explore the Transcript Viewer for displaying results
Configure Audio Playback controls

Get Started

Core Features

Configuration

Deployment

Overview

Supported Audio Formats

Configuration Options

Key Options

Making API Calls

Advanced Options

Temperature & Seed

Keyterms

Entity Detection

Audio Event Tagging

Response Handling

Error Handling

Default Configuration

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Features

Configuration

Deployment

Documentation Index

​Overview

​Supported Audio Formats

​Configuration Options

​Key Options

​Making API Calls

​Advanced Options

​Temperature & Seed

​Keyterms

​Entity Detection

​Audio Event Tagging

​Response Handling

​Error Handling

​Default Configuration

​Next Steps

Build docs developers (and LLMs) love

Overview

Supported Audio Formats

Configuration Options

Key Options

Making API Calls

Advanced Options

Temperature & Seed

Keyterms

Entity Detection

Audio Event Tagging

Response Handling

Error Handling

Default Configuration

Next Steps