Synchronization Methods

Overview

The Multi-Camera Video Synchronization system offers two distinct approaches for aligning videos from different cameras:

Audio Sync

High-precision alignment using GCC-PHAT cross-correlation on audio tracks

Visual Sync

Motion-based alignment using frame difference correlation

Method Selection

Choose your synchronization method based on your recording conditions:

Audio Sync
Visual Sync

When to Use Audio Sync

Videos have audible audio tracks with shared sound events

High precision required (sub-millisecond accuracy)

Environment has ambient sound or speech

Cameras are in the same acoustic space

Limitations

Requires audio tracks on all videos

May fail with completely silent videos

Sensitive to audio clipping or severe noise

Comparison

Feature	Audio (GCC-PHAT)	Visual (Motion)
Precision	Sub-millisecond	30-100ms
Speed	Fast (FFT-based)	Moderate (frame processing)
Requirements	Audio tracks	Visible motion
Robustness	High (with clean audio)	High (with visible motion)
Silent Videos	❌ Not supported	✅ Supported

Pairwise + Global Optimization

Both methods use a robust two-stage approach:

Pairwise Alignment

Compute offsets between all pairs of videos, generating N(N-1)/2 measurements with confidence scores

# For 3 videos: A, B, C
pairs = [(A, B), (A, C), (B, C)]
# Each pair yields: (offset, confidence)

Global Optimization

Use weighted least-squares to find globally consistent offsets that best satisfy all pairwise constraints

# Minimize: Σ w_ij * (offset_j - offset_i - d_ij)²
# where d_ij is measured offset, w_ij is confidence

Outlier Detection

Flag inconsistent pairs where the optimized solution deviates significantly from measurements

Why Pairwise? Using all pairs instead of a single reference video makes the system more robust when individual videos have degradation (noise, clipping, motion blur). If one video has issues, the other pairs compensate.

Configuration

The synchronization method is configured in src/config.py:

# src/config.py
SYNC_METHOD = "audio"  # or "visual"

The sync method must be set before starting the application. It is not configurable from the web UI.

Technical Architecture

Both methods follow this pipeline:

Feature Extraction

Audio: Extract WAV tracks

ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 44100 -ac 1 audio.wav

Converts video audio to mono 16-bit PCM WAV for processing.

Visual: Extract motion energy timeseries

# Pseudo-code
for frame in video:
    gray = convert_to_grayscale(frame)
    diff = abs(gray - previous_gray)
    energy = sum(diff > threshold) / pixels
    motion_signal.append(energy)

Creates a 1D signal representing motion intensity over time.

Offset Computation

Both methods use cross-correlation to measure time shifts:

Audio: GCC-PHAT in frequency domain (FFT-based)
Visual: Standard cross-correlation on motion signals

See the individual method pages for detailed algorithms:

Confidence Scoring

Each pairwise offset includes a confidence score (0.0 to 1.0):

confidence = peak_magnitude / (noise_floor + epsilon)
confidence = confidence / (confidence + 1.0)  # Normalize to [0, 1]

Pairs with confidence < 0.3 are flagged with warnings. The global optimization uses confidence as weights, giving less influence to uncertain measurements.

Performance Characteristics

Computational Complexity

Pairwise computation: O(N²) where N = number of videos
Per-pair correlation: O(M log M) where M = signal length (FFT-based)
Optimization: O(N) iterations, typically converges in <10 steps

Memory Usage

Audio: Loads all WAV files into memory (~10MB per minute of mono audio)
Visual: Processes video frame-by-frame (memory-efficient)
Peak usage: Typically 200-500MB for 4 videos, 5 minutes each

Typical Runtime

For 4 videos, 5 minutes each:

Audio: 5-15 seconds (extraction: 3-5s, sync: 2-10s)
Visual: 30-90 seconds (motion extraction: 20-60s, sync: 10-30s)

Error Handling

The system includes multiple robustness features:

# Low confidence warning
if confidence < 0.3:
    logger.warning("Low confidence (%.2f) - sync may be unreliable", confidence)

# Boundary check
if abs(offset_seconds) > max_offset_sec * 0.9:
    logger.warning("Offset near search boundary - may be truncated")

Manual Review Recommended: Always preview synchronized videos using the built-in multi-video player before exporting. Automated sync can fail in edge cases (e.g., no shared motion/audio, extreme camera angles).

Next Steps

Audio Sync Details

Deep dive into GCC-PHAT algorithm

Visual Sync Details

Motion detection and correlation

Offset Semantics

Understand positive/negative offsets

Get Started

Core Concepts

User Guide

Evaluation Suite

Overview

Audio Sync

Visual Sync

Method Selection

When to Use Audio Sync

Limitations

When to Use Visual Sync

Limitations

Comparison

Pairwise + Global Optimization

Configuration

Technical Architecture

Feature Extraction

Offset Computation

Confidence Scoring

Performance Characteristics

Error Handling

Next Steps

Audio Sync Details

Visual Sync Details

Offset Semantics

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Evaluation Suite

​Overview

Audio Sync

Visual Sync

​Method Selection

​When to Use Audio Sync

​Limitations

​When to Use Visual Sync

​Limitations

​Comparison

​Pairwise + Global Optimization

​Configuration

​Technical Architecture

​Feature Extraction

​Offset Computation

​Confidence Scoring

​Performance Characteristics

​Error Handling

​Next Steps

Audio Sync Details

Visual Sync Details

Offset Semantics

Build docs developers (and LLMs) love

Overview

Method Selection

When to Use Audio Sync

Limitations

When to Use Visual Sync

Limitations

Comparison

Pairwise + Global Optimization

Configuration

Technical Architecture

Feature Extraction

Offset Computation

Confidence Scoring

Performance Characteristics

Error Handling

Next Steps