Skip to main content

Overview

Visual synchronization aligns videos by correlating motion patterns across different camera views. Even when cameras capture the same scene from different angles, the timing of motion events (walking, gestures, objects moving) remains the same.
Key Insight: While the visual content differs across camera angles, the temporal occurrence of motion events is synchronized. A person raising their hand happens at the same instant across all cameras.

Algorithm Pipeline

1

Extract Motion Energy

Convert each video into a 1D timeseries representing motion intensity at each frame
2

Smooth & Normalize

Apply temporal smoothing and normalize signals for robust correlation
3

Pairwise Cross-Correlation

Correlate all pairs of motion signals to find time offsets
4

Global Optimization

Solve for globally consistent offsets using weighted least-squares

Motion Energy Extraction

The core algorithm computes frame-to-frame differences:

Step-by-Step Process

# visual_sync.py:63-74
# Center crop to focus on main action area
if center_crop:
    h, w = frame.shape[:2]
    start_y, end_y = int(h * 0.25), int(h * 0.75)
    start_x, end_x = int(w * 0.25), int(w * 0.75)
    frame = frame[start_y:end_y, start_x:end_x]

# Downsample for speed (default 4x)
h, w = frame.shape[:2]
small_frame = cv2.resize(frame, (w // downsample, h // downsample))
gray = cv2.cvtColor(small_frame, cv2.COLOR_BGR2GRAY)

# Apply Gaussian blur to reduce noise
if blur_size > 0:
    gray = cv2.GaussianBlur(gray, (blur_size, blur_size), 0)
downsample
int
default:"4"
Spatial downsampling factor (higher = faster, less detail)
blur_size
int
default:"5"
Gaussian blur kernel size (reduces noise sensitivity)
center_crop
bool
default:"true"
Focus on center 50% of frame (ignores edges/timestamps)
# visual_sync.py:76-80
if prev_gray is not None:
    diff = cv2.absdiff(gray, prev_gray)
    _, thresh = cv2.threshold(diff, 15, 255, cv2.THRESH_BINARY)
    energy = np.sum(thresh) / (thresh.shape[0] * thresh.shape[1] * 255)
    motion_energy.append(energy)
This computes the percentage of pixels that changed significantly (> 15 gray levels) between consecutive frames.Example Output:
  • Static scene: energy ≈ 0.01 (1% of pixels changed)
  • Person walking: energy ≈ 0.15 (15% of pixels changed)
  • Fast motion: energy ≈ 0.40 (40% of pixels changed)
# visual_sync.py:53-57
if step > 1 and frame_idx > 0:
    for _ in range(step - 1):
        if not cap.grab():
            break
        frame_idx += 1
Process every Nth frame (default step=3). For a 30fps video:
  • With step=3: Effective rate = 10fps
  • 5-minute video: 9000 samples → 3000 samples
Motion events (human gestures, walking) occur over multiple frames, so 10fps is sufficient for alignment while providing 3x speedup.

Motion Signal Smoothing

Raw frame differences are noisy. Apply temporal smoothing:
# visual_sync.py:96-103
def smooth_motion_signal(signal: np.ndarray, 
                         fps: float,
                         window_sec: float = 0.2) -> np.ndarray:
    """Smooth the motion signal to reduce noise."""
    window_frames = int(window_sec * fps)
    if window_frames < 1:
        window_frames = 1
    return uniform_filter1d(signal, window_frames)
window_sec
float
default:"0.2"
Smoothing window duration. At 10fps, 0.2s = 2-frame moving average.

Cross-Correlation

Once motion signals are extracted, compute time offsets between pairs:
# visual_sync.py:105-133
def correlate_motion_signals(sig1: np.ndarray, sig2: np.ndarray,
                             fps: float,
                             max_offset_sec: float = 20.0) -> Tuple[float, float]:
    """Find time offset between two motion signals using cross-correlation."""
    # Normalize signals
    sig1_norm = (sig1 - np.mean(sig1)) / (np.std(sig1) + 1e-10)
    sig2_norm = (sig2 - np.mean(sig2)) / (np.std(sig2) + 1e-10)
    
    # Compute full cross-correlation
    cc = correlate(sig1_norm, sig2_norm, mode='full')
    
    # Constrain search to realistic lag range
    max_lag_frames = int(max_offset_sec * fps)
    center = len(sig2_norm) - 1
    search_start = max(0, center - max_lag_frames)
    search_end = min(len(cc), center + max_lag_frames)
    
    # Find peak in search region
    search_region = cc[search_start:search_end]
    lag_idx_local = np.argmax(search_region)
    lag_idx = search_start + lag_idx_local
    lag_frames = lag_idx - center
    offset_seconds = lag_frames / fps
    
    # Confidence scoring
    peak = cc[lag_idx]
    mean_cc = np.mean(np.abs(cc))
    std_cc = np.std(cc)
    confidence = (peak - mean_cc) / (std_cc + 1e-10)
    confidence = float(np.clip(confidence / 10.0, 0, 1))
    
    return offset_seconds, confidence

Confidence Interpretation

The confidence score measures how distinct the correlation peak is:
Strong, unique correlation peak
Clear shared motion patterns
Reliable offset estimate
Correlation Function:
      *          ← Clear peak
     / \
____/   \____    ← Low baseline

Global Optimization

After computing all pairwise offsets, solve for globally consistent alignment:
# visual_sync.py:218-228
def residuals(offsets):
    res = []
    for (f1, f2), (offset, conf) in pairwise_offsets.items():
        i, j = file_to_idx[f1], file_to_idx[f2]
        # Offset j - Offset i should equal measured offset
        res.append(np.sqrt(conf) * (offsets[j] - offsets[i] - offset))
    return np.array(res)

x0 = np.zeros(n)
result = least_squares(residuals, x0, loss='soft_l1', f_scale=0.5)
offsets_opt = result.x - result.x[0]  # Anchor first file to t=0
Weighted Least-Squares: High-confidence pairs have more influence on the final solution. The soft_l1 loss function reduces impact of outliers (failed pairwise estimates).

Performance Optimizations

Parallel Processing

# visual_sync.py:176-178
with ThreadPoolExecutor() as executor:
    results = list(executor.map(process_one, selected_files))
Motion extraction runs in parallel across all videos.

Frame Skipping

step=3  # Process every 3rd frame
# 30fps → 10fps effective sampling
3x speedup with negligible accuracy loss.

Spatial Downsampling

downsample=4  # 1920x1080 → 480x270
16x fewer pixels to process per frame.

Center Cropping

center_crop=True  # Use center 50% of frame
Ignores edges (timestamps, UI elements), focuses on action.

Typical Runtime

For 4 videos, 1920x1080, 30fps, 5 minutes each:
StageDuration
Motion extraction (parallel)20-40s
Pairwise correlation (6 pairs)5-10s
Global optimization<1s
Total25-50s

Visualization

The system generates diagnostic plots:
# visual_sync.py:135-154
def visualize_motion_signals(motion_signals: Dict[str, np.ndarray],
                             fps: float,
                             output_path: str):
    """Create a visualization of motion signals."""
    import matplotlib.pyplot as plt
    n = len(motion_signals)
    fig, axes = plt.subplots(n, 1, figsize=(14, 3*n), sharex=True)
    
    for ax, (name, signal) in zip(axes, motion_signals.items()):
        time = np.arange(len(signal)) / fps
        ax.plot(time, signal, 'b-', linewidth=0.5)
        ax.set_ylabel('Motion')
        ax.set_title(name)
        ax.grid(True, alpha=0.3)
    
    plt.savefig(output_path, dpi=150)
Camera A: ____/\____/\/\/\_____/\____
Camera B: ____/\____/\/\/\_____/\____  (synchronized)
Camera C: __/\____/\/\/\_____/\______  (offset -2s)

Common Failure Modes

Symptom: Low confidence scores across all pairsCause: No significant motion in overlapping time periodsSolution:
  • Use audio sync if available
  • Manually clap or create a visible event at recording start
  • Ensure cameras have overlapping field of view with motion
Symptom: Moderate confidence, inconsistent pairwise offsetsCause: Cameras point at completely different areas (no shared motion)Solution:
  • Verify cameras capture the same scene
  • Use audio sync as alternative
  • Ensure at least some overlapping view area
Symptom: Noisy motion signal, low correlation peaksCause: Fast motion at low framerate causes blurSolution:
  • Increase blur_size parameter to smooth more aggressively
  • Reduce step to sample more frames
  • Record at higher framerate (60fps recommended)

Advanced Configuration

Tune parameters in the source code:
# visual_sync.py:156-159
motion, eff_fps = extract_motion_energy(
    path, 
    downsample=4,      # Spatial downsampling
    blur_size=5,       # Gaussian blur kernel
    center_crop=True,  # Crop to center 50%
    step=3             # Temporal downsampling
)

# visual_sync.py:185-191
motion = smooth_motion_signal(
    motion, 
    eff_fps,
    window_sec=0.2     # Smoothing window
)

# visual_sync.py:157
offset, conf = correlate_motion_signals(
    sig1, sig2, 
    target_fps,
    max_offset_sec=20.0  # Maximum search range
)
downsample
int
default:"4"
Spatial downsampling factor (1-8). Higher = faster but less accurate.
step
int
default:"3"
Frame skip factor. Process every Nth frame.
blur_size
int
default:"5"
Gaussian blur kernel (odd number). Larger = more noise reduction.
window_sec
float
default:"0.2"
Smoothing window duration in seconds.
max_offset_sec
float
default:"20.0"
Maximum expected time offset between videos.

Comparison to Audio Sync

AspectVisual (Motion)Audio (GCC-PHAT)
Precision±30-100ms±1-10ms
SpeedModerate (30-60s)Fast (5-15s)
Silent videos✅ Works❌ Requires audio
Different angles⚠️ Needs shared view✅ Works anywhere
RobustnessHigh (motion-based)High (frequency-based)
Best Practice: If videos have audio, use audio sync for higher precision. Use visual sync for silent recordings or as a validation method.

Source Code Reference

Key functions in src/visual_sync.py:
  • extract_motion_energy() - Line 33: Frame-by-frame motion extraction
  • smooth_motion_signal() - Line 96: Temporal smoothing
  • correlate_motion_signals() - Line 105: Cross-correlation
  • sync_videos_by_motion() - Line 156: Main entry point
  • visualize_motion_signals() - Line 135: Diagnostic plotting

Next Steps

Audio Sync

Learn about GCC-PHAT audio alignment

Offset Semantics

Understand how offsets are applied

Build docs developers (and LLMs) love