Visual Synchronization

Overview

Visual synchronization aligns videos by correlating motion patterns across different camera views. Even when cameras capture the same scene from different angles, the timing of motion events (walking, gestures, objects moving) remains the same.

Key Insight: While the visual content differs across camera angles, the temporal occurrence of motion events is synchronized. A person raising their hand happens at the same instant across all cameras.

Algorithm Pipeline

Extract Motion Energy

Convert each video into a 1D timeseries representing motion intensity at each frame

Smooth & Normalize

Apply temporal smoothing and normalize signals for robust correlation

Pairwise Cross-Correlation

Correlate all pairs of motion signals to find time offsets

Global Optimization

Solve for globally consistent offsets using weighted least-squares

Motion Energy Extraction

The core algorithm computes frame-to-frame differences:

Step-by-Step Process

1. Frame Preprocessing

# visual_sync.py:63-74
# Center crop to focus on main action area
if center_crop:
    h, w = frame.shape[:2]
    start_y, end_y = int(h * 0.25), int(h * 0.75)
    start_x, end_x = int(w * 0.25), int(w * 0.75)
    frame = frame[start_y:end_y, start_x:end_x]

# Downsample for speed (default 4x)
h, w = frame.shape[:2]
small_frame = cv2.resize(frame, (w // downsample, h // downsample))
gray = cv2.cvtColor(small_frame, cv2.COLOR_BGR2GRAY)

# Apply Gaussian blur to reduce noise
if blur_size > 0:
    gray = cv2.GaussianBlur(gray, (blur_size, blur_size), 0)

downsample

int

default:"4"

Spatial downsampling factor (higher = faster, less detail)

blur_size

int

default:"5"

Gaussian blur kernel size (reduces noise sensitivity)

center_crop

bool

default:"true"

Focus on center 50% of frame (ignores edges/timestamps)

2. Frame Difference Calculation

# visual_sync.py:76-80
if prev_gray is not None:
    diff = cv2.absdiff(gray, prev_gray)
    _, thresh = cv2.threshold(diff, 15, 255, cv2.THRESH_BINARY)
    energy = np.sum(thresh) / (thresh.shape[0] * thresh.shape[1] * 255)
    motion_energy.append(energy)

This computes the percentage of pixels that changed significantly (> 15 gray levels) between consecutive frames.Example Output:

Static scene: energy ≈ 0.01 (1% of pixels changed)
Person walking: energy ≈ 0.15 (15% of pixels changed)
Fast motion: energy ≈ 0.40 (40% of pixels changed)

3. Frame Skipping for Efficiency

# visual_sync.py:53-57
if step > 1 and frame_idx > 0:
    for _ in range(step - 1):
        if not cap.grab():
            break
        frame_idx += 1

Process every Nth frame (default step=3). For a 30fps video:

With step=3: Effective rate = 10fps
5-minute video: 9000 samples → 3000 samples

Motion events (human gestures, walking) occur over multiple frames, so 10fps is sufficient for alignment while providing 3x speedup.

Motion Signal Smoothing

Raw frame differences are noisy. Apply temporal smoothing:

# visual_sync.py:96-103
def smooth_motion_signal(signal: np.ndarray, 
                         fps: float,
                         window_sec: float = 0.2) -> np.ndarray:
    """Smooth the motion signal to reduce noise."""
    window_frames = int(window_sec * fps)
    if window_frames < 1:
        window_frames = 1
    return uniform_filter1d(signal, window_frames)

window_sec

float

default:"0.2"

Smoothing window duration. At 10fps, 0.2s = 2-frame moving average.

Cross-Correlation

Once motion signals are extracted, compute time offsets between pairs:

# visual_sync.py:105-133
def correlate_motion_signals(sig1: np.ndarray, sig2: np.ndarray,
                             fps: float,
                             max_offset_sec: float = 20.0) -> Tuple[float, float]:
    """Find time offset between two motion signals using cross-correlation."""
    # Normalize signals
    sig1_norm = (sig1 - np.mean(sig1)) / (np.std(sig1) + 1e-10)
    sig2_norm = (sig2 - np.mean(sig2)) / (np.std(sig2) + 1e-10)
    
    # Compute full cross-correlation
    cc = correlate(sig1_norm, sig2_norm, mode='full')
    
    # Constrain search to realistic lag range
    max_lag_frames = int(max_offset_sec * fps)
    center = len(sig2_norm) - 1
    search_start = max(0, center - max_lag_frames)
    search_end = min(len(cc), center + max_lag_frames)
    
    # Find peak in search region
    search_region = cc[search_start:search_end]
    lag_idx_local = np.argmax(search_region)
    lag_idx = search_start + lag_idx_local
    lag_frames = lag_idx - center
    offset_seconds = lag_frames / fps
    
    # Confidence scoring
    peak = cc[lag_idx]
    mean_cc = np.mean(np.abs(cc))
    std_cc = np.std(cc)
    confidence = (peak - mean_cc) / (std_cc + 1e-10)
    confidence = float(np.clip(confidence / 10.0, 0, 1))
    
    return offset_seconds, confidence

Confidence Interpretation

The confidence score measures how distinct the correlation peak is:

High Confidence (>0.5)
Medium Confidence (0.3-0.5)
Low Confidence (<0.3)

Strong, unique correlation peak

Clear shared motion patterns

Reliable offset estimate

Correlation Function:
      *          ← Clear peak
     / \
____/   \____    ← Low baseline

Moderate correlation peak

Some ambiguity in alignment

Review recommended

Correlation Function:
    * *          ← Multiple peaks
   /   \
__/     \___     ← Some noise

Weak or ambiguous peak

Insufficient shared motion

Manual alignment may be needed

Correlation Function:
~~~*~*~*~*~~      ← No clear peak

Global Optimization

After computing all pairwise offsets, solve for globally consistent alignment:

# visual_sync.py:218-228
def residuals(offsets):
    res = []
    for (f1, f2), (offset, conf) in pairwise_offsets.items():
        i, j = file_to_idx[f1], file_to_idx[f2]
        # Offset j - Offset i should equal measured offset
        res.append(np.sqrt(conf) * (offsets[j] - offsets[i] - offset))
    return np.array(res)

x0 = np.zeros(n)
result = least_squares(residuals, x0, loss='soft_l1', f_scale=0.5)
offsets_opt = result.x - result.x[0]  # Anchor first file to t=0

Weighted Least-Squares: High-confidence pairs have more influence on the final solution. The soft_l1 loss function reduces impact of outliers (failed pairwise estimates).

Performance Optimizations

Parallel Processing

# visual_sync.py:176-178
with ThreadPoolExecutor() as executor:
    results = list(executor.map(process_one, selected_files))

Motion extraction runs in parallel across all videos.

Frame Skipping

step=3  # Process every 3rd frame
# 30fps → 10fps effective sampling

3x speedup with negligible accuracy loss.

Spatial Downsampling

downsample=4  # 1920x1080 → 480x270

16x fewer pixels to process per frame.

Center Cropping

center_crop=True  # Use center 50% of frame

Ignores edges (timestamps, UI elements), focuses on action.

Typical Runtime

For 4 videos, 1920x1080, 30fps, 5 minutes each:

Stage	Duration
Motion extraction (parallel)	20-40s
Pairwise correlation (6 pairs)	5-10s
Global optimization	`<1s`
Total	25-50s

Visualization

The system generates diagnostic plots:

# visual_sync.py:135-154
def visualize_motion_signals(motion_signals: Dict[str, np.ndarray],
                             fps: float,
                             output_path: str):
    """Create a visualization of motion signals."""
    import matplotlib.pyplot as plt
    n = len(motion_signals)
    fig, axes = plt.subplots(n, 1, figsize=(14, 3*n), sharex=True)
    
    for ax, (name, signal) in zip(axes, motion_signals.items()):
        time = np.arange(len(signal)) / fps
        ax.plot(time, signal, 'b-', linewidth=0.5)
        ax.set_ylabel('Motion')
        ax.set_title(name)
        ax.grid(True, alpha=0.3)
    
    plt.savefig(output_path, dpi=150)

Camera A: ____/\____/\/\/\_____/\____
Camera B: ____/\____/\/\/\_____/\____  (synchronized)
Camera C: __/\____/\/\/\_____/\______  (offset -2s)

Common Failure Modes

Static Scenes

Symptom: Low confidence scores across all pairsCause: No significant motion in overlapping time periodsSolution:

Use audio sync if available
Manually clap or create a visible event at recording start
Ensure cameras have overlapping field of view with motion

Different Camera Angles

Symptom: Moderate confidence, inconsistent pairwise offsetsCause: Cameras point at completely different areas (no shared motion)Solution:

Verify cameras capture the same scene
Use audio sync as alternative
Ensure at least some overlapping view area

Motion Blur / Low Framerate

Symptom: Noisy motion signal, low correlation peaksCause: Fast motion at low framerate causes blurSolution:

Increase blur_size parameter to smooth more aggressively
Reduce step to sample more frames
Record at higher framerate (60fps recommended)

Advanced Configuration

Tune parameters in the source code:

# visual_sync.py:156-159
motion, eff_fps = extract_motion_energy(
    path, 
    downsample=4,      # Spatial downsampling
    blur_size=5,       # Gaussian blur kernel
    center_crop=True,  # Crop to center 50%
    step=3             # Temporal downsampling
)

# visual_sync.py:185-191
motion = smooth_motion_signal(
    motion, 
    eff_fps,
    window_sec=0.2     # Smoothing window
)

# visual_sync.py:157
offset, conf = correlate_motion_signals(
    sig1, sig2, 
    target_fps,
    max_offset_sec=20.0  # Maximum search range
)

downsample

int

default:"4"

Spatial downsampling factor (1-8). Higher = faster but less accurate.

step

int

default:"3"

Frame skip factor. Process every Nth frame.

blur_size

int

default:"5"

Gaussian blur kernel (odd number). Larger = more noise reduction.

window_sec

float

default:"0.2"

Smoothing window duration in seconds.

max_offset_sec

float

default:"20.0"

Maximum expected time offset between videos.

Comparison to Audio Sync

Aspect	Visual (Motion)	Audio (GCC-PHAT)
Precision	±30-100ms	±1-10ms
Speed	Moderate (30-60s)	Fast (5-15s)
Silent videos	✅ Works	❌ Requires audio
Different angles	⚠️ Needs shared view	✅ Works anywhere
Robustness	High (motion-based)	High (frequency-based)

Best Practice: If videos have audio, use audio sync for higher precision. Use visual sync for silent recordings or as a validation method.

Source Code Reference

Key functions in src/visual_sync.py:

extract_motion_energy() - Line 33: Frame-by-frame motion extraction
smooth_motion_signal() - Line 96: Temporal smoothing
correlate_motion_signals() - Line 105: Cross-correlation
sync_videos_by_motion() - Line 156: Main entry point
visualize_motion_signals() - Line 135: Diagnostic plotting

Get Started

Core Concepts

User Guide

Evaluation Suite

Overview

Algorithm Pipeline

Motion Energy Extraction

Step-by-Step Process

Motion Signal Smoothing

Cross-Correlation

Confidence Interpretation

Global Optimization

Performance Optimizations

Parallel Processing

Frame Skipping

Spatial Downsampling

Center Cropping

Typical Runtime

Visualization

Common Failure Modes

Advanced Configuration

Comparison to Audio Sync

Source Code Reference

Next Steps

Audio Sync

Offset Semantics

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Evaluation Suite

​Overview

​Algorithm Pipeline

​Motion Energy Extraction

​Step-by-Step Process

​Motion Signal Smoothing

​Cross-Correlation

​Confidence Interpretation

​Global Optimization

​Performance Optimizations

Parallel Processing

Frame Skipping

Spatial Downsampling

Center Cropping

​Typical Runtime

​Visualization

​Common Failure Modes

​Advanced Configuration

​Comparison to Audio Sync

​Source Code Reference

​Next Steps

Audio Sync

Offset Semantics

Build docs developers (and LLMs) love

Overview

Algorithm Pipeline

Motion Energy Extraction

Step-by-Step Process

Motion Signal Smoothing

Cross-Correlation

Confidence Interpretation

Global Optimization

Performance Optimizations

Typical Runtime

Visualization

Common Failure Modes

Advanced Configuration

Comparison to Audio Sync

Source Code Reference

Next Steps