Overview
The Multi-Camera Video Synchronization system offers two distinct approaches for aligning videos from different cameras:Audio Sync
High-precision alignment using GCC-PHAT cross-correlation on audio tracks
Visual Sync
Motion-based alignment using frame difference correlation
Method Selection
Choose your synchronization method based on your recording conditions:- Audio Sync
- Visual Sync
Comparison
| Feature | Audio (GCC-PHAT) | Visual (Motion) |
|---|---|---|
| Precision | Sub-millisecond | 30-100ms |
| Speed | Fast (FFT-based) | Moderate (frame processing) |
| Requirements | Audio tracks | Visible motion |
| Robustness | High (with clean audio) | High (with visible motion) |
| Silent Videos | ❌ Not supported | ✅ Supported |
Pairwise + Global Optimization
Both methods use a robust two-stage approach:Pairwise Alignment
Compute offsets between all pairs of videos, generating N(N-1)/2 measurements with confidence scores
Global Optimization
Use weighted least-squares to find globally consistent offsets that best satisfy all pairwise constraints
Why Pairwise? Using all pairs instead of a single reference video makes the system more robust when individual videos have degradation (noise, clipping, motion blur). If one video has issues, the other pairs compensate.
Configuration
The synchronization method is configured insrc/config.py:
The sync method must be set before starting the application. It is not configurable from the web UI.
Technical Architecture
Both methods follow this pipeline:Feature Extraction
Audio: Extract WAV tracks
Audio: Extract WAV tracks
Visual: Extract motion energy timeseries
Visual: Extract motion energy timeseries
Offset Computation
Both methods use cross-correlation to measure time shifts:- Audio: GCC-PHAT in frequency domain (FFT-based)
- Visual: Standard cross-correlation on motion signals
Confidence Scoring
Each pairwise offset includes a confidence score (0.0 to 1.0):Performance Characteristics
Computational Complexity
Computational Complexity
- Pairwise computation: O(N²) where N = number of videos
- Per-pair correlation: O(M log M) where M = signal length (FFT-based)
- Optimization: O(N) iterations, typically converges in
<10steps
Memory Usage
Memory Usage
- Audio: Loads all WAV files into memory (~10MB per minute of mono audio)
- Visual: Processes video frame-by-frame (memory-efficient)
- Peak usage: Typically 200-500MB for 4 videos, 5 minutes each
Typical Runtime
Typical Runtime
For 4 videos, 5 minutes each:
- Audio: 5-15 seconds (extraction: 3-5s, sync: 2-10s)
- Visual: 30-90 seconds (motion extraction: 20-60s, sync: 10-30s)
Error Handling
The system includes multiple robustness features:Next Steps
Audio Sync Details
Deep dive into GCC-PHAT algorithm
Visual Sync Details
Motion detection and correlation
Offset Semantics
Understand positive/negative offsets