Audio Synchronization

Overview

Audio synchronization uses GCC-PHAT (Generalized Cross-Correlation with Phase Transform) to align videos by correlating their audio tracks. This method provides sub-millisecond precision when videos share common sound events.

GCC-PHAT is a frequency-domain cross-correlation technique that emphasizes phase information while de-emphasizing magnitude. This makes it robust to differences in microphone placement, gain, and frequency response.

Algorithm: GCC-PHAT

Mathematical Foundation

Given two audio signals A(t) and B(t), GCC-PHAT computes:

FFT: A(f) = FFT[a(t)], B(f) = FFT[b(t)]
Cross-Power Spectrum: R(f) = A(f) · conj(B(f))
Phase Transform: R_PHAT(f) = R(f) / |R(f)|
IFFT: cc(τ) = IFFT[R_PHAT(f)]
Find peak: τ_offset = argmax(cc(τ))

The peak location in the correlation function cc(τ) indicates the time offset.

Why Phase Transform?

The division by magnitude |R(f)| normalizes each frequency bin to unit magnitude:

R_PHAT(f) = R(f) / |R(f)| = exp(i·θ(f))

This retains only phase information, which is more reliable than magnitude for time-delay estimation because:

Phase is consistent across microphone locations
Magnitude varies with distance, orientation, and frequency response
Phase differences directly encode time delays: Δφ = 2πf·Δt

Implementation

Core Function: `compute_gcc_phat()`

The complete algorithm in src/audio_sync.py:

# audio_sync.py:27-118
def compute_gcc_phat(sig_a: np.ndarray, sig_b: np.ndarray, fs: int, 
                     max_offset_sec: float = 10.0, 
                     window_sec: Optional[float] = None) -> Tuple[float, float]:
    """
    Compute time offset between two signals using GCC-PHAT.
    
    Args:
        sig_a: Reference signal
        sig_b: Signal to align
        fs: Sample rate (Hz)
        max_offset_sec: Maximum expected offset (default 10s)
        window_sec: Use only first N seconds for speed (default: use all)
    
    Returns:
        (offset_seconds, confidence_score)
        offset_seconds is the amount to add to sig_b timestamps to align to sig_a.
    """

Step-by-Step Breakdown

Bandpass Filtering

# audio_sync.py:44-46
sos = butter(4, [300, 5000], btype='bandpass', fs=fs, output='sos')
sig_a = sosfilt(sos, sig_a)
sig_b = sosfilt(sos, sig_b)

bandpass

300-5000 Hz

Focuses on speech/ambient range, removes DC offset and high-frequency noise

Why? Most relevant audio content (speech, footsteps, ambient) falls in this range. Filtering improves SNR and reduces impact of low-frequency rumble or ultrasonic noise.

Windowing (Optional)

# audio_sync.py:48-51
if window_sec is not None:
    window_samples = int(window_sec * fs)
    sig_a = sig_a[:window_samples]
    sig_b = sig_b[:window_samples]

window_sec

float

default:"30.0"

Use only first N seconds (speedup). Typical: 30s sufficient for alignment.

Trade-off: Shorter windows = faster computation, but require shared audio events in that window.

Zero-Mean Normalization

# audio_sync.py:52-53
a = sig_a - np.mean(sig_a)
b = sig_b - np.mean(sig_b)

Removes DC bias before FFT.

Zero-Padding to Power of 2

# audio_sync.py:54-55
max_len = max(len(a), len(b))
n = next_pow2(2 * max_len)  # Pad to avoid circular convolution

Why 2x? Cross-correlation requires 2·max_len - 1 output samples. Padding to next power of 2 enables efficient FFT.

FFT and Phase Transform

# audio_sync.py:56-62
A = fft(a, n=n)
B = fft(b, n=n)
R = A * np.conj(B)  # Cross-power spectrum
denom = np.abs(R)
denom[denom < 1e-8] = 1e-8  # Avoid division by zero
R_phat = R / denom  # Phase transform
cc = np.real(ifft(R_phat))  # Correlation function

This is the core GCC-PHAT operation.

Lag Alignment

# audio_sync.py:63
cc = np.concatenate((cc[-(n//2):], cc[:n//2]))  # FFT shift

Rearranges correlation from [0, n) to [-n/2, n/2) for centered lags.

Constrained Peak Search

# audio_sync.py:65-72
max_lag_samples = int(max_offset_sec * fs)
center = n // 2
search_start = max(0, center - max_lag_samples)
search_end = min(len(cc), center + max_lag_samples)
search_region = cc[search_start:search_end]
lag_idx_local = np.argmax(np.abs(search_region))
lag_idx = search_start + lag_idx_local

max_offset_sec

float

default:"10.0"

Only search within ±10s. Prevents false peaks at extreme lags.

Sub-Sample Interpolation

# audio_sync.py:76-82
if 0 < lag_idx < len(cc) - 1:
    y1, y2, y3 = cc[lag_idx-1], cc[lag_idx], cc[lag_idx+1]
    denom_interp = 2*y2 - y1 - y3
    if abs(denom_interp) > 1e-8:
        delta = 0.5 * (y3 - y1) / denom_interp
        offset_seconds += delta / float(fs)

Parabolic interpolation refines the peak location to sub-sample precision.

At 44.1kHz sampling, each sample = 22.7μs. Interpolation achieves ~2-5μs precision.

Confidence Scoring

# audio_sync.py:84-98
peak = np.abs(cc[lag_idx])
window = int(0.01 * fs)  # 10ms exclusion window
exclude_start = max(0, lag_idx - window)
exclude_end = min(len(cc), lag_idx + window)

# Compute noise floor excluding peak region
mag = np.abs(cc)
noise_vals = np.concatenate((mag[:exclude_start], mag[exclude_end:]))
noise_floor = np.mean(noise_vals) if noise_vals.size > 0 else np.mean(mag)

# Normalized confidence
confidence = float(peak / (noise_floor + 1e-8))
confidence = confidence / (confidence + 1.0)  # Map to [0, 1]

Confidence Interpretation:

>0.7: Excellent (clear shared audio)
0.3-0.7: Good (reliable alignment)
<0.3: Poor (low confidence warning issued)

Sign Convention

# audio_sync.py:118
return -offset_seconds, confidence

Note the negative sign. The offset represents how much to shift sig_b to align with sig_a. See Offset Semantics for details.

Pairwise Alignment

Instead of using a single reference video, the system computes all pairwise offsets:

# audio_sync.py:120-197
def compute_pairwise_offsets(audio_dir: str, 
                            max_offset_sec: float = 10.0,
                            window_sec: Optional[float] = 30.0,
                            min_confidence: float = 0.0) -> Dict[Tuple[str, str], Tuple[float, float]]:
    """
    Compute offsets between all pairs of WAV files.
    
    Returns:
        Dict mapping (fileA, fileB) -> (offset_seconds, confidence)
    """

Process

Load Audio
Resample to Common Rate
Compute All Pairs

# audio_sync.py:143-151
signals = {}
sample_rates = {}
for w in wavs:
    path = os.path.join(audio_dir, w)
    sig, sr = load_audio(path)
    signals[w] = sig
    sample_rates[w] = sr

All audio files loaded into memory for fast pairwise processing.

# audio_sync.py:154-169
ref_sr = max(set(sample_rates.values()), key=list(sample_rates.values()).count)

for w in wavs:
    if sample_rates[w] != ref_sr:
        sig = signals[w]
        sr = sample_rates[w]
        duration = len(sig) / sr
        new_len = int(round(duration * ref_sr))
        signals[w] = np.interp(
            np.linspace(0, len(sig), new_len, endpoint=False),
            np.arange(len(sig)), 
            sig
        ).astype(np.float32)

All signals resampled to most common sample rate (typically 44100 Hz or 48000 Hz) for consistent correlation.

# audio_sync.py:176-193
for i, file_a in enumerate(wavs):
    for j, file_b in enumerate(wavs[i+1:], start=i+1):
        sig_a = signals[file_a]
        sig_b = signals[file_b]
        offset, conf = compute_gcc_phat(
            sig_a, sig_b, ref_sr, 
            max_offset_sec=max_offset_sec,
            window_sec=window_sec
        )
        if conf >= min_confidence:
            pairwise[(file_a, file_b)] = (offset, conf)
        else:
            logger.warning("SKIPPED (confidence=%.3f)", conf)

For N videos: N(N-1)/2 pairs computed.

3 videos: 3 pairs
4 videos: 6 pairs
5 videos: 10 pairs

Global Optimization

Raw pairwise offsets may be inconsistent (e.g., A→B = 1s, B→C = 2s, C→A = -2.5s violates triangle inequality). Global optimization finds the best-fit offsets:

# audio_sync.py:199-239
def optimize_offsets(pairwise: Dict[Tuple[str, str], Tuple[float, float]], 
                     wavs: List[str]) -> Dict[str, float]:
    """
    Find globally consistent offsets using weighted least-squares.
    
    Minimizes: Σ w_AB * (offset_B - offset_A - d_AB)²
    """
    def residuals(offsets):
        res = []
        for (file_a, file_b), (d_ab, conf) in pairwise.items():
            i = file_to_idx[file_a]
            j = file_to_idx[file_b]
            # offset_B - offset_A should equal d_AB
            error = offsets[j] - offsets[i] - d_ab
            res.append(np.sqrt(conf) * error)  # Weight by sqrt(confidence)
        return np.array(res)
    
    x0 = np.zeros(n)
    result = least_squares(residuals, x0, loss='soft_l1', f_scale=0.1)
    offsets_opt = result.x - result.x[0]  # Anchor first file to 0

Optimization Details

Least-Squares Formulation

For each pair (A, B) with measured offset d_AB and confidence w_AB:

Minimize: E = Σ w_AB · (offset_B - offset_A - d_AB)²

This is a linear least-squares problem solved by scipy.optimize.least_squares.

Confidence Weighting

res.append(np.sqrt(conf) * error)

Since we minimize squared residuals, weighting by √confidence gives effective weight of confidence in the objective.

High-confidence pairs (e.g., 0.9) have 3x more influence than low-confidence pairs (e.g., 0.3).

Soft L1 Loss

loss='soft_l1', f_scale=0.1

Soft L1 (Huber loss) reduces impact of outliers:

ρ(r) = { r²/2           if |r| ≤ f_scale
       { f_scale·|r| - f_scale²/2  otherwise

Residuals > 0.1s are treated as outliers and down-weighted.

Anchoring

offsets_opt = result.x - result.x[0]

The first video is anchored to t=0 (arbitrary reference frame). All other offsets are relative to it.

Outlier Detection

After optimization, the system flags inconsistent pairwise measurements:

# utils.py (called from audio_sync.py:284)
outliers = detect_outliers(pairwise, optimized, threshold=0.5)

A pair (A, B) is flagged if:

|optimized[B] - optimized[A] - measured_offset_AB| > 0.5s

Outliers indicate pairwise sync failures (e.g., no shared audio between those two files). The global solution uses other pairs to infer reasonable offsets.

Performance Characteristics

Computational Complexity

Per-Pair Correlation
Pairwise Stage
Optimization

FFT: O(N log N) where N = next_pow2(2 · window_samples)

For 30s window at 44.1kHz:

Samples: 30 × 44100 = 1,323,000
Padded: 2^21 = 2,097,152
FFT ops: ~44 million

Typical runtime: 100-300ms per pair

Total pairs: N(N-1)/2

For 4 videos:

Pairs: 6
Runtime: 6 × 200ms = 1.2s

Iterations: Typically 5-10
Per iteration: O(P) where P = number of pairs

Typical runtime: <100ms

Memory Usage

# 5-minute mono audio at 44.1kHz:
samples = 5 * 60 * 44100 = 13,230,000
bytes = samples * 4 (float32) = 52.9 MB

# For 4 videos:
total_memory ≈ 4 × 53 MB = 212 MB

All audio is loaded into memory for fast pairwise processing. For large datasets (>10 videos or >30 min each), consider reducing window_sec.

Robustness Features

Bandpass Filtering

300-5000 Hz focus removes DC drift and high-frequency noise

Phase Transform

Normalizes magnitude differences (mic placement, gain)

Confidence Thresholding

Skips low-confidence pairs (default min=0.2)

Weighted Optimization

High-confidence pairs dominate solution

Outlier-Robust Loss

Soft L1 reduces impact of misaligned pairs

Sample Rate Normalization

Resamples all audio to common rate

Common Issues

Low Confidence Warnings

WARNING: Low confidence (0.25) - sync may be unreliable

Causes:

No shared audio events in the windowed segment
Severe audio clipping or distortion
Different acoustic environments (outdoor vs indoor)

Solutions:

Increase window_sec to 60s or use full audio (window_sec=None)
Verify audio tracks actually overlap in time
Use visual sync as fallback

Offset Near Boundary

WARNING: Offset (9.8s) near search boundary - may be truncated

Cause: True offset exceeds max_offset_sec limitSolution: Increase max_offset_sec (default 10s)

estimate_offsets_robust(audio_dir, max_offset_sec=30.0)

No Valid Pairwise Offsets

ValueError: No valid pairwise offsets found - all pairs below confidence threshold

Cause: min_confidence too strict or genuinely no shared audioSolution:

Lower min_confidence from 0.2 to 0.1
Check that videos actually have audio (ffprobe -i video.mp4)
Use visual sync instead

Configuration

Entry Point

# audio_sync.py:241-291
def estimate_offsets_robust(audio_dir: str, 
                           max_offset_sec: float = 10.0,
                           window_sec: Optional[float] = 30.0,
                           min_confidence: float = 0.2,
                           outlier_threshold: float = 0.5) -> Dict[str, float]:
    """
    Robust offset estimation using pairwise alignment + global optimization.
    """

Parameters

audio_dir

str

required

Directory containing WAV files (extracted from videos)

max_offset_sec

float

default:"10.0"

Maximum expected offset between any two files. Search range = ±max_offset_sec.

window_sec

float | None

default:"30.0"

Use only first N seconds of audio for speed. Set to None to use full audio.

min_confidence

float

default:"0.2"

Skip pairs with confidence below this threshold. Range: 0.0 to 1.0.

outlier_threshold

float

default:"0.5"

Flag pairs with residual error > N seconds after optimization.

Source Code Reference

Key functions in src/audio_sync.py:

Line 27: compute_gcc_phat() - Core GCC-PHAT algorithm
Line 120: compute_pairwise_offsets() - All-pairs alignment
Line 199: optimize_offsets() - Global least-squares solver
Line 241: estimate_offsets_robust() - Main entry point

Next Steps

Visual Sync

Learn about motion-based alignment

Offset Semantics

Understand how offsets are applied

Get Started

Core Concepts

User Guide

Evaluation Suite

Overview

Algorithm: GCC-PHAT

Mathematical Foundation

Implementation

Core Function: `compute_gcc_phat()`

Step-by-Step Breakdown

Sign Convention

Pairwise Alignment

Process

Global Optimization

Optimization Details

Outlier Detection

Performance Characteristics

Computational Complexity

Memory Usage

Robustness Features

Bandpass Filtering

Phase Transform

Confidence Thresholding

Weighted Optimization

Outlier-Robust Loss

Sample Rate Normalization

Common Issues

Configuration

Entry Point

Parameters

Source Code Reference

Next Steps

Visual Sync

Offset Semantics

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Evaluation Suite

​Overview

​Algorithm: GCC-PHAT

​Mathematical Foundation

​Implementation

​Core Function: compute_gcc_phat()

​Step-by-Step Breakdown

​Sign Convention

​Pairwise Alignment

​Process

​Global Optimization

​Optimization Details

​Outlier Detection

​Performance Characteristics

​Computational Complexity

​Memory Usage

​Robustness Features

Bandpass Filtering

Phase Transform

Confidence Thresholding

Weighted Optimization

Outlier-Robust Loss

Sample Rate Normalization

​Common Issues

​Configuration

​Entry Point

​Parameters

​Source Code Reference

​Next Steps

Visual Sync

Offset Semantics

Build docs developers (and LLMs) love

Overview

Algorithm: GCC-PHAT

Mathematical Foundation

Implementation

Core Function: `compute_gcc_phat()`

Step-by-Step Breakdown

Sign Convention

Pairwise Alignment

Process

Global Optimization

Optimization Details

Outlier Detection

Performance Characteristics

Computational Complexity

Memory Usage

Robustness Features

Common Issues

Configuration

Entry Point

Parameters

Source Code Reference

Next Steps