Frames give a model the visual track of a video. Transcription gives it the words. Together they let the LLM see, read, and reason about a video at the same time — without the model having to perform speech recognition itself or receiving a wall of unstructured audio.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HUANGCHIHHUNGLeo/claude-real-video/llms.txt
Use this file to discover all available pages before exploring further.
claude-real-video follows a strict priority chain for transcription: it always prefers the most accurate, fastest source available before falling back to anything slower. Whisper is a last resort, not the default — and when a video genuinely has no audio, that fact is reported honestly in MANIFEST.txt rather than treated as a tool failure.
Transcription priority
Drawn from theprocess() function in core.py:
-
Sidecar subtitle file — if the source is a local file (not a URL), the tool looks for a
.srtor.vttfile with the same basename in the same directory. For example,lecture.mp4paired withlecture.srtwill use the.srtdirectly. Timecodes, sequence numbers, and inline tags are stripped, leaving clean plain text. -
Embedded subtitle stream — if the video file carries an embedded subtitle track (detected via
ffprobe -select_streams s), it is extracted withffmpeg -map 0:s:0and converted to plain text the same way. -
Whisper transcription — when no subtitles exist, the tool extracts a 16 kHz mono WAV (
audio.wav) from the video and passes it to thewhisperCLI. This requires the[whisper]extra (see below). Whisper uses thebasemodel by default and writes its output to{out_dir}/audio.txt; the file is then renamed to{out_dir}/transcript.txt. -
No transcript — if the video has no audio track, or
--no-transcribewas passed, the manifest records the reason honestly (e.g."(none — this video has no subtitles and no audio track)"). This is not an error.
{out_dir}/transcript.txt and its content is inlined into MANIFEST.txt after the --- transcript --- separator.
Installing Whisper
Whisper is an optional dependency. Install it with the bundled extra or directly:ffmpeg to be installed on your system. See the installation guide for platform-specific instructions.
Whisper is invoked via the
whisper CLI using the base model by default (--model base). The base model balances speed and accuracy well for most speech. If you need higher accuracy on difficult audio (heavy accents, noisy environments), install openai-whisper and run whisper manually on {out_dir}/audio.wav with a larger model such as --model medium or --model large.Language detection
By default, Whisper auto-detects the spoken language (--lang auto). Pass --lang with a BCP-47 language code to force a specific language and improve accuracy:
lang is set to anything other than auto, it is passed to Whisper as --language {lang}. Auto-detect is omitted from the Whisper command entirely, letting Whisper use its built-in language identification.
Skipping transcription
Use--no-transcribe to skip all transcription steps entirely. No subtitle extraction, no Whisper invocation, no WAV file:
transcript: (skipped: --no-transcribe). This is useful when you only need the frames, or when you plan to transcribe separately with a different tool.
Keeping the full audio track
--keep-audio saves the complete original soundtrack — music, speech, and sound effects — as {out_dir}/audio.m4a. The implementation first attempts a lossless stream copy (works for AAC and ALAC sources), and falls back to a high-quality AAC re-encode at 192 kbps for other codecs (Opus, Vorbis, …).
audio.m4a is recorded in MANIFEST.txt under the audio: key.
If the video has no audio track, --keep-audio is a no-op and the manifest records audio: (none — this video has no audio track).