Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HUANGCHIHHUNGLeo/claude-real-video/llms.txt

Use this file to discover all available pages before exploring further.

claude-real-video lets any vision-capable LLM actually watch a video. Point it at a YouTube URL or a local file, and it pulls only the frames that matter — every scene change, not a rigid per-second quota — throws away near-duplicates, transcribes the audio with Whisper, and writes a clean output folder your LLM can read. Everything runs on your own machine: no video is uploaded, no credentials are shared.

Fixed-interval sampling vs. claude-real-video

Most “let an LLM watch a video” tools — including Gemini’s own pipeline — grab frames at a fixed interval (1 fps by default). That over-samples a static screencast and silently drops frames in a fast-cut reel. claude-real-video uses scene-change detection and sliding-window deduplication instead.
Fixed-interval samplingclaude-real-video
Frame selectionEvery N secondsScene-change detection + density floor
Repeated shots (A-B-A cuts)Sent again every timeSliding-window dedup sends each shot once
Static slide (10 min)~600 near-identical framesCollapses to 1 (dedup)
Fast-cut reelsMisses frames between samplesCatches each visual change
AudioOften ignoredWhisper transcript with language detection
Where the video goesVideo often uploaded to the cloudStays on your machine
Input sourcesUsually local file onlyURL (via yt-dlp) or local file
The result is fewer, more meaningful frames — cheaper context for the model and better understanding of what actually happened in the video.

Key capabilities

Scene-change detection

Uses ffmpeg’s scene filter combined with a configurable density floor to capture every visual transition — fast-cut reels and 10-minute static slides are both handled correctly.

Sliding-window deduplication

Compares real pixel differences (downscaled RGB) against the last N kept frames. A shot the model already saw does not come back after a cutaway — A-B-A alternation is suppressed.

Whisper transcription

Prefers subtitles already embedded in the video (faster and more accurate), and falls back to Whisper only when none exist. Supports automatic language detection or an explicit language code.

Fully local processing

No frames, audio, or transcripts leave your machine. Works with login-gated sources via a Netscape cookie file — your credentials stay under your control.

Supported LLMs

claude-real-video produces output any vision-capable LLM can read. The output is just JPEG frames and a plain-text transcript — no proprietary format, no SDK required. Drop the frames and MANIFEST.txt into the chat window of your preferred model:
  • Claude (Claude.ai or API)
  • ChatGPT (GPT-4o and later)
  • Gemini (1.5 Pro / Flash and later)
  • Any other model that accepts image attachments and plain text
The MANIFEST.txt file includes the source URL, video duration, frame count, dedup statistics, and the full transcript — everything a model needs to reason about the video without any additional context.

Requirements

RequirementNotes
Python 3.10+Required
ffmpeg (on PATH)Used for frame extraction and audio processing; not pip-installable
yt-dlpBundled as a dependency; used for URL sources (YouTube, Instagram, TikTok, …)
openai-whisperOptional; required for audio transcription (pip install "claude-real-video[whisper]")
Ready to get started? Head to Installation to install claude-real-video and set up ffmpeg.

Build docs developers (and LLMs) love