Documentation Index
Fetch the complete documentation index at: https://mintlify.com/konhi/elevenlabs-speech-to-text-api-ui/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Speaker diarization automatically detects and separates different speakers in an audio file. This feature is essential for transcribing meetings, interviews, podcasts, and any multi-speaker content.
Enabling Diarization
To enable speaker detection, set the diarize option to true:
const options: TranscriptOptions = {
modelId: "scribe_v2",
timestampsGranularity: "character",
diarize: true, // Enable speaker detection
// ... other options
};
Configuration Options
Number of Speakers
You can specify the expected number of speakers, or let the API auto-detect:
Leave numSpeakers undefined to automatically detect the number of speakers:diarize: true,
numSpeakers: undefined // Auto-detect
Specify the exact number of speakers (1-32):diarize: true,
numSpeakers: 3 // Expect 3 speakers
This can improve accuracy when you know the speaker count in advance.
Diarization Threshold
When auto-detecting speakers, you can control the sensitivity with diarizationThreshold:
diarize: true,
numSpeakers: undefined,
diarizationThreshold: 0.6 // Range: 0.0-1.0
- Lower values (e.g., 0.3): More speakers detected, may split single speaker
- Higher values (e.g., 0.8): Fewer speakers detected, may merge different speakers
- Default: API determines optimal threshold
The diarizationThreshold option only applies when numSpeakers is not specified. If you provide a fixed speaker count, this threshold is ignored.
UI Implementation
The transcription form includes conditional rendering for the diarization threshold input:
{options.diarize && !options.numSpeakers && (
<div className="space-y-2">
<Label htmlFor="diarization-threshold">
Diarization Threshold (0.0-1.0)
</Label>
<Input
id="diarization-threshold"
type="number"
step="0.01"
min="0"
max="1"
placeholder="Auto"
value={options.diarizationThreshold || ""}
onChange={handleDiarizationThresholdChange}
/>
</div>
)}
This input only appears when:
- Diarization is enabled (
options.diarize === true)
- Speaker count is not fixed (
!options.numSpeakers)
Working with Speaker Data
Each word in the transcript includes speaker information:
type TranscriptWord = {
text: string;
start: number;
end: number;
speakerId?: string; // e.g., "speaker_0", "speaker_1"
// ... other fields
};
Get all unique speakers from a transcript:
export function getUniqueSpeakers(words: TranscriptWord[]): string[] {
const speakers = new Set<string>();
words.forEach((word) => {
if (word.speakerId) {
speakers.add(word.speakerId);
}
});
return Array.from(speakers).sort();
}
Speaker Name Mapping
The application allows users to assign custom names to detected speakers:
type SpeakerNames = Record<string, string>;
// Example:
const speakerNames: SpeakerNames = {
"speaker_0": "Alice",
"speaker_1": "Bob",
"speaker_2": "Charlie"
};
Users can update speaker names dynamically:
function handleSpeakerNameChange(speakerId: string, newName: string) {
setSpeakerNames((prev) => ({
...prev,
[speakerId]: newName,
}));
}
Transcript Export with Speakers
The application can generate markdown transcripts with speaker labels:
export function buildTranscriptMarkdown(
words: TranscriptWord[],
options: {
includeTimestamps: boolean;
includeSpeakers: boolean;
getSpeakerName: (speakerId: string) => string;
}
): string {
let markdown = "# Transcript\n\n";
let currentSpeaker: string | undefined;
let currentParagraph: MarkdownWord[] = [];
function flushParagraph() {
if (currentParagraph.length === 0) return;
if (options.includeSpeakers && currentSpeaker) {
markdown += `**${options.getSpeakerName(currentSpeaker)}:** `;
}
markdown += currentParagraph.map((word) => word.text).join("");
if (
options.includeTimestamps &&
currentParagraph[0] &&
currentParagraph[0].time !== undefined
) {
markdown += ` _(${formatTimestamp(currentParagraph[0].time)})_`;
}
markdown += "\n\n";
currentParagraph = [];
}
words.forEach((word) => {
if (word.type === "word") {
const hasSpeakerChanged =
word.speakerId && word.speakerId !== currentSpeaker;
if (hasSpeakerChanged) {
flushParagraph();
currentSpeaker = word.speakerId;
}
currentParagraph.push({
text: word.text,
time: word.start,
});
} else if (word.type === "spacing") {
if (currentParagraph.length > 0) {
currentParagraph.push({ text: word.text });
}
}
});
flushParagraph();
return markdown;
}
Example Output
With speaker detection enabled, the markdown output looks like:
# Transcript
**Alice:** Hello everyone, welcome to today's meeting. _(0:00)_
**Bob:** Thanks for having me. I'd like to discuss the new features. _(0:05)_
**Alice:** Great! Let's start with the transcription API. _(0:12)_
Multi-Channel Audio
For recordings with separate audio channels per speaker (e.g., professional studio recordings):
diarize: true,
useMultiChannel: true // Process channels separately
Multi-channel processing requires audio files where each speaker is recorded on a separate channel. This is different from stereo audio where both channels contain mixed audio.
UI Controls
The transcription form includes checkboxes for diarization controls:
<div className="flex items-center space-x-2">
<Checkbox
id="diarize"
checked={options.diarize}
onCheckedChange={handleDiarizeChange}
/>
<Label htmlFor="diarize" className="cursor-pointer">
Diarize (Speaker Detection)
</Label>
</div>
<div className="flex items-center space-x-2">
<Checkbox
id="multichannel"
checked={options.useMultiChannel}
onCheckedChange={handleMultiChannelChange}
/>
<Label htmlFor="multichannel" className="cursor-pointer">
Multi-channel Audio
</Label>
</div>
Best Practices
- Use fixed speaker count when you know it in advance for better accuracy
- Adjust threshold based on audio quality and speaker similarity
- Enable multi-channel only for properly recorded multi-track audio
- Label speakers with recognizable names for better readability
- Test different thresholds if auto-detection merges or splits speakers incorrectly
Next Steps