Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HUANGCHIHHUNGLeo/claude-real-video/llms.txt
Use this file to discover all available pages before exploring further.
claude-real-video turns any video — a URL or a local file — into a clean, LLM-readable folder in five stages. It fetches the source, extracts scene-aware frames in a single chronological pass, deduplicates them with a sliding-window pixel-difference algorithm, transcribes the audio (preferring existing subtitles over Whisper), and writes a MANIFEST.txt the model can read as context. The result: fewer, more meaningful frames alongside a full transcript — cheaper context, sharper understanding.
Fetch
Function:
fetch_video(src, out_dir, cookies=None)If src starts with http:// or https://, yt-dlp downloads the video and writes it to source.mp4 inside the output directory. An optional --cookies argument accepts a Netscape-format cookie file for login-gated sources (your own, authorised access only). If yt-dlp cannot write source.mp4 directly (e.g. the container differs), the function falls back to the first matching source.* file.For a local path, shutil.copy copies the file verbatim to source.mp4. Either way, every subsequent stage reads a single canonical file at {out_dir}/source.mp4.Extract frames
Function:
extract_frames(video, frames_dir, scene, fps_floor)A single ffmpeg pass with a select filter captures every scene change and a density floor simultaneously, keeping all frames in chronological order:gt(scene,{scene})— fire on any frame where the scene-change score exceeds the threshold (default0.30).not(mod(n,{every_n}))— also keep every Nth frame, whereevery_n = max(1, round(fps × fps_floor)). At 25 fps with the default--fps-floor 1.0, that is every 25th frame — at least one frame per second regardless of how static the video is.scale=640:-1— resize to 640 px wide (aspect ratio preserved) for consistent, manageable image sizes.
-vsync vfr flag prevents duplicated timestamps. Frames are written as raw_00001.jpg, raw_00002.jpg, … into {out_dir}/frames/.Deduplicate
Function:
dedup_frames(frames_dir, threshold, window, max_frames, dropped_dir)Near-duplicate frames — repeated shots, static screencasts, A-B-A cutaways — are removed by a sliding-window pixel-difference algorithm:- Each
raw_*.jpgframe is downscaled to 16×16 RGB to form a compact signature. - Per-pixel max channel difference is computed against each signature in a window of the last
windowkept frames (default4). A pixel is considered “changed” when any colour channel differs by more than a 25-unit tolerance. - A frame is kept if its minimum distance to any window frame exceeds
threshold% (default8). Otherwise it is dropped (or moved todropped/if--reportis active). - The window catches A-B-A cutaways — if shot A appeared frames ago and shot B sat in between, the model won’t see shot A a second time.
frame_001.jpg, frame_002.jpg, … If more survivors remain than --max-frames (default 150), the list is uniformly thinned: step = len(kept) / max_frames, keeping every step-th frame spread evenly across the video timeline.Transcription
Function:
existing_subtitles() → transcribe()Transcription follows a strict priority chain so the fastest, most accurate source is always used:- Sidecar subtitle file — a
.srtor.vttfile with the same basename next to a local source file (e.g.lecture.srtalongsidelecture.mp4). - Embedded subtitle stream — extracted via
ffmpeg -map 0:s:0and converted to plain text. - Whisper fallback — the
whisperCLI is invoked on a 16 kHz mono WAV extracted from the video (requires the[whisper]extra). - No transcript — if the video has no audio track, or
--no-transcribewas passed, this is reported honestly inMANIFEST.txtrather than treated as an error.
{out_dir}/transcript.txt as plain text, with subtitle indices, timecodes, and inline tags stripped.MANIFEST
Written at: If
{out_dir}/MANIFEST.txtThe manifest is a plain-text file summarising the analysis for the LLM that will read it. A typical manifest contains:--why was passed, a viewing intent: line is prepended so the model reads the frames and transcript through the specified analytical lens rather than producing a generic summary. If --keep-audio was used, an audio: line records the path to audio.m4a.Audio preservation (optional)
Passing--keep-audio saves the complete original soundtrack to {out_dir}/audio.m4a. The implementation first attempts a lossless stream copy (works for AAC/ALAC sources) and falls back to a high-quality AAC re-encode at 192 kbps for other codecs (Opus, Vorbis, …).
The transcript captures the words — audio.m4a lets audio-capable models (Gemini, GPT-4o, …) actually hear the music, tone, and sound effects.