Skip to main content

Overview

The Multi-Camera Video Synchronization system offers two distinct approaches for aligning videos from different cameras:

Audio Sync

High-precision alignment using GCC-PHAT cross-correlation on audio tracks

Visual Sync

Motion-based alignment using frame difference correlation

Method Selection

Choose your synchronization method based on your recording conditions:

When to Use Audio Sync

Videos have audible audio tracks with shared sound events
High precision required (sub-millisecond accuracy)
Environment has ambient sound or speech
Cameras are in the same acoustic space

Limitations

Requires audio tracks on all videos
May fail with completely silent videos
Sensitive to audio clipping or severe noise

Comparison

FeatureAudio (GCC-PHAT)Visual (Motion)
PrecisionSub-millisecond30-100ms
SpeedFast (FFT-based)Moderate (frame processing)
RequirementsAudio tracksVisible motion
RobustnessHigh (with clean audio)High (with visible motion)
Silent Videos❌ Not supported✅ Supported

Pairwise + Global Optimization

Both methods use a robust two-stage approach:
1

Pairwise Alignment

Compute offsets between all pairs of videos, generating N(N-1)/2 measurements with confidence scores
# For 3 videos: A, B, C
pairs = [(A, B), (A, C), (B, C)]
# Each pair yields: (offset, confidence)
2

Global Optimization

Use weighted least-squares to find globally consistent offsets that best satisfy all pairwise constraints
# Minimize: Σ w_ij * (offset_j - offset_i - d_ij)²
# where d_ij is measured offset, w_ij is confidence
3

Outlier Detection

Flag inconsistent pairs where the optimized solution deviates significantly from measurements
Why Pairwise? Using all pairs instead of a single reference video makes the system more robust when individual videos have degradation (noise, clipping, motion blur). If one video has issues, the other pairs compensate.

Configuration

The synchronization method is configured in src/config.py:
# src/config.py
SYNC_METHOD = "audio"  # or "visual"
The sync method must be set before starting the application. It is not configurable from the web UI.

Technical Architecture

Both methods follow this pipeline:

Feature Extraction

ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 44100 -ac 1 audio.wav
Converts video audio to mono 16-bit PCM WAV for processing.
# Pseudo-code
for frame in video:
    gray = convert_to_grayscale(frame)
    diff = abs(gray - previous_gray)
    energy = sum(diff > threshold) / pixels
    motion_signal.append(energy)
Creates a 1D signal representing motion intensity over time.

Offset Computation

Both methods use cross-correlation to measure time shifts:
  • Audio: GCC-PHAT in frequency domain (FFT-based)
  • Visual: Standard cross-correlation on motion signals
See the individual method pages for detailed algorithms:

Confidence Scoring

Each pairwise offset includes a confidence score (0.0 to 1.0):
confidence = peak_magnitude / (noise_floor + epsilon)
confidence = confidence / (confidence + 1.0)  # Normalize to [0, 1]
Pairs with confidence < 0.3 are flagged with warnings. The global optimization uses confidence as weights, giving less influence to uncertain measurements.

Performance Characteristics

  • Pairwise computation: O(N²) where N = number of videos
  • Per-pair correlation: O(M log M) where M = signal length (FFT-based)
  • Optimization: O(N) iterations, typically converges in <10 steps
  • Audio: Loads all WAV files into memory (~10MB per minute of mono audio)
  • Visual: Processes video frame-by-frame (memory-efficient)
  • Peak usage: Typically 200-500MB for 4 videos, 5 minutes each
For 4 videos, 5 minutes each:
  • Audio: 5-15 seconds (extraction: 3-5s, sync: 2-10s)
  • Visual: 30-90 seconds (motion extraction: 20-60s, sync: 10-30s)

Error Handling

The system includes multiple robustness features:
# Low confidence warning
if confidence < 0.3:
    logger.warning("Low confidence (%.2f) - sync may be unreliable", confidence)

# Boundary check
if abs(offset_seconds) > max_offset_sec * 0.9:
    logger.warning("Offset near search boundary - may be truncated")
Manual Review Recommended: Always preview synchronized videos using the built-in multi-video player before exporting. Automated sync can fail in edge cases (e.g., no shared motion/audio, extreme camera angles).

Next Steps

Audio Sync Details

Deep dive into GCC-PHAT algorithm

Visual Sync Details

Motion detection and correlation

Offset Semantics

Understand positive/negative offsets

Build docs developers (and LLMs) love