Skip to main content

Overview

Audio synchronization uses GCC-PHAT (Generalized Cross-Correlation with Phase Transform) to align videos by correlating their audio tracks. This method provides sub-millisecond precision when videos share common sound events.
GCC-PHAT is a frequency-domain cross-correlation technique that emphasizes phase information while de-emphasizing magnitude. This makes it robust to differences in microphone placement, gain, and frequency response.

Algorithm: GCC-PHAT

Mathematical Foundation

Given two audio signals A(t) and B(t), GCC-PHAT computes:
1. FFT: A(f) = FFT[a(t)], B(f) = FFT[b(t)]
2. Cross-Power Spectrum: R(f) = A(f) · conj(B(f))
3. Phase Transform: R_PHAT(f) = R(f) / |R(f)|
4. IFFT: cc(τ) = IFFT[R_PHAT(f)]
5. Find peak: τ_offset = argmax(cc(τ))
The peak location in the correlation function cc(τ) indicates the time offset.
The division by magnitude |R(f)| normalizes each frequency bin to unit magnitude:
R_PHAT(f) = R(f) / |R(f)| = exp(i·θ(f))
This retains only phase information, which is more reliable than magnitude for time-delay estimation because:
  • Phase is consistent across microphone locations
  • Magnitude varies with distance, orientation, and frequency response
  • Phase differences directly encode time delays: Δφ = 2πf·Δt

Implementation

Core Function: compute_gcc_phat()

The complete algorithm in src/audio_sync.py:
# audio_sync.py:27-118
def compute_gcc_phat(sig_a: np.ndarray, sig_b: np.ndarray, fs: int, 
                     max_offset_sec: float = 10.0, 
                     window_sec: Optional[float] = None) -> Tuple[float, float]:
    """
    Compute time offset between two signals using GCC-PHAT.
    
    Args:
        sig_a: Reference signal
        sig_b: Signal to align
        fs: Sample rate (Hz)
        max_offset_sec: Maximum expected offset (default 10s)
        window_sec: Use only first N seconds for speed (default: use all)
    
    Returns:
        (offset_seconds, confidence_score)
        offset_seconds is the amount to add to sig_b timestamps to align to sig_a.
    """

Step-by-Step Breakdown

1

Bandpass Filtering

# audio_sync.py:44-46
sos = butter(4, [300, 5000], btype='bandpass', fs=fs, output='sos')
sig_a = sosfilt(sos, sig_a)
sig_b = sosfilt(sos, sig_b)
bandpass
300-5000 Hz
Focuses on speech/ambient range, removes DC offset and high-frequency noise
Why? Most relevant audio content (speech, footsteps, ambient) falls in this range. Filtering improves SNR and reduces impact of low-frequency rumble or ultrasonic noise.
2

Windowing (Optional)

# audio_sync.py:48-51
if window_sec is not None:
    window_samples = int(window_sec * fs)
    sig_a = sig_a[:window_samples]
    sig_b = sig_b[:window_samples]
window_sec
float
default:"30.0"
Use only first N seconds (speedup). Typical: 30s sufficient for alignment.
Trade-off: Shorter windows = faster computation, but require shared audio events in that window.
3

Zero-Mean Normalization

# audio_sync.py:52-53
a = sig_a - np.mean(sig_a)
b = sig_b - np.mean(sig_b)
Removes DC bias before FFT.
4

Zero-Padding to Power of 2

# audio_sync.py:54-55
max_len = max(len(a), len(b))
n = next_pow2(2 * max_len)  # Pad to avoid circular convolution
Why 2x? Cross-correlation requires 2·max_len - 1 output samples. Padding to next power of 2 enables efficient FFT.
5

FFT and Phase Transform

# audio_sync.py:56-62
A = fft(a, n=n)
B = fft(b, n=n)
R = A * np.conj(B)  # Cross-power spectrum
denom = np.abs(R)
denom[denom < 1e-8] = 1e-8  # Avoid division by zero
R_phat = R / denom  # Phase transform
cc = np.real(ifft(R_phat))  # Correlation function
This is the core GCC-PHAT operation.
6

Lag Alignment

# audio_sync.py:63
cc = np.concatenate((cc[-(n//2):], cc[:n//2]))  # FFT shift
Rearranges correlation from [0, n) to [-n/2, n/2) for centered lags.
7

Constrained Peak Search

# audio_sync.py:65-72
max_lag_samples = int(max_offset_sec * fs)
center = n // 2
search_start = max(0, center - max_lag_samples)
search_end = min(len(cc), center + max_lag_samples)
search_region = cc[search_start:search_end]
lag_idx_local = np.argmax(np.abs(search_region))
lag_idx = search_start + lag_idx_local
max_offset_sec
float
default:"10.0"
Only search within ±10s. Prevents false peaks at extreme lags.
8

Sub-Sample Interpolation

# audio_sync.py:76-82
if 0 < lag_idx < len(cc) - 1:
    y1, y2, y3 = cc[lag_idx-1], cc[lag_idx], cc[lag_idx+1]
    denom_interp = 2*y2 - y1 - y3
    if abs(denom_interp) > 1e-8:
        delta = 0.5 * (y3 - y1) / denom_interp
        offset_seconds += delta / float(fs)
Parabolic interpolation refines the peak location to sub-sample precision.
At 44.1kHz sampling, each sample = 22.7μs. Interpolation achieves ~2-5μs precision.
9

Confidence Scoring

# audio_sync.py:84-98
peak = np.abs(cc[lag_idx])
window = int(0.01 * fs)  # 10ms exclusion window
exclude_start = max(0, lag_idx - window)
exclude_end = min(len(cc), lag_idx + window)

# Compute noise floor excluding peak region
mag = np.abs(cc)
noise_vals = np.concatenate((mag[:exclude_start], mag[exclude_end:]))
noise_floor = np.mean(noise_vals) if noise_vals.size > 0 else np.mean(mag)

# Normalized confidence
confidence = float(peak / (noise_floor + 1e-8))
confidence = confidence / (confidence + 1.0)  # Map to [0, 1]
Confidence Interpretation:
  • >0.7: Excellent (clear shared audio)
  • 0.3-0.7: Good (reliable alignment)
  • <0.3: Poor (low confidence warning issued)

Sign Convention

# audio_sync.py:118
return -offset_seconds, confidence
Note the negative sign. The offset represents how much to shift sig_b to align with sig_a. See Offset Semantics for details.

Pairwise Alignment

Instead of using a single reference video, the system computes all pairwise offsets:
# audio_sync.py:120-197
def compute_pairwise_offsets(audio_dir: str, 
                            max_offset_sec: float = 10.0,
                            window_sec: Optional[float] = 30.0,
                            min_confidence: float = 0.0) -> Dict[Tuple[str, str], Tuple[float, float]]:
    """
    Compute offsets between all pairs of WAV files.
    
    Returns:
        Dict mapping (fileA, fileB) -> (offset_seconds, confidence)
    """

Process

# audio_sync.py:143-151
signals = {}
sample_rates = {}
for w in wavs:
    path = os.path.join(audio_dir, w)
    sig, sr = load_audio(path)
    signals[w] = sig
    sample_rates[w] = sr
All audio files loaded into memory for fast pairwise processing.

Global Optimization

Raw pairwise offsets may be inconsistent (e.g., A→B = 1s, B→C = 2s, C→A = -2.5s violates triangle inequality). Global optimization finds the best-fit offsets:
# audio_sync.py:199-239
def optimize_offsets(pairwise: Dict[Tuple[str, str], Tuple[float, float]], 
                     wavs: List[str]) -> Dict[str, float]:
    """
    Find globally consistent offsets using weighted least-squares.
    
    Minimizes: Σ w_AB * (offset_B - offset_A - d_AB)²
    """
    def residuals(offsets):
        res = []
        for (file_a, file_b), (d_ab, conf) in pairwise.items():
            i = file_to_idx[file_a]
            j = file_to_idx[file_b]
            # offset_B - offset_A should equal d_AB
            error = offsets[j] - offsets[i] - d_ab
            res.append(np.sqrt(conf) * error)  # Weight by sqrt(confidence)
        return np.array(res)
    
    x0 = np.zeros(n)
    result = least_squares(residuals, x0, loss='soft_l1', f_scale=0.1)
    offsets_opt = result.x - result.x[0]  # Anchor first file to 0

Optimization Details

For each pair (A, B) with measured offset d_AB and confidence w_AB:
Minimize: E = Σ w_AB · (offset_B - offset_A - d_AB)²
This is a linear least-squares problem solved by scipy.optimize.least_squares.
res.append(np.sqrt(conf) * error)
Since we minimize squared residuals, weighting by √confidence gives effective weight of confidence in the objective.
High-confidence pairs (e.g., 0.9) have 3x more influence than low-confidence pairs (e.g., 0.3).
loss='soft_l1', f_scale=0.1
Soft L1 (Huber loss) reduces impact of outliers:
ρ(r) = { r²/2           if |r| ≤ f_scale
       { f_scale·|r| - f_scale²/2  otherwise
Residuals > 0.1s are treated as outliers and down-weighted.
offsets_opt = result.x - result.x[0]
The first video is anchored to t=0 (arbitrary reference frame). All other offsets are relative to it.

Outlier Detection

After optimization, the system flags inconsistent pairwise measurements:
# utils.py (called from audio_sync.py:284)
outliers = detect_outliers(pairwise, optimized, threshold=0.5)
A pair (A, B) is flagged if:
|optimized[B] - optimized[A] - measured_offset_AB| > 0.5s
Outliers indicate pairwise sync failures (e.g., no shared audio between those two files). The global solution uses other pairs to infer reasonable offsets.

Performance Characteristics

Computational Complexity

FFT: O(N log N) where N = next_pow2(2 · window_samples)
For 30s window at 44.1kHz:
  • Samples: 30 × 44100 = 1,323,000
  • Padded: 2^21 = 2,097,152
  • FFT ops: ~44 million
Typical runtime: 100-300ms per pair

Memory Usage

# 5-minute mono audio at 44.1kHz:
samples = 5 * 60 * 44100 = 13,230,000
bytes = samples * 4 (float32) = 52.9 MB

# For 4 videos:
total_memory ≈ 4 × 53 MB = 212 MB
All audio is loaded into memory for fast pairwise processing. For large datasets (>10 videos or >30 min each), consider reducing window_sec.

Robustness Features

Bandpass Filtering

300-5000 Hz focus removes DC drift and high-frequency noise

Phase Transform

Normalizes magnitude differences (mic placement, gain)

Confidence Thresholding

Skips low-confidence pairs (default min=0.2)

Weighted Optimization

High-confidence pairs dominate solution

Outlier-Robust Loss

Soft L1 reduces impact of misaligned pairs

Sample Rate Normalization

Resamples all audio to common rate

Common Issues

WARNING: Low confidence (0.25) - sync may be unreliable
Causes:
  • No shared audio events in the windowed segment
  • Severe audio clipping or distortion
  • Different acoustic environments (outdoor vs indoor)
Solutions:
  • Increase window_sec to 60s or use full audio (window_sec=None)
  • Verify audio tracks actually overlap in time
  • Use visual sync as fallback
WARNING: Offset (9.8s) near search boundary - may be truncated
Cause: True offset exceeds max_offset_sec limitSolution: Increase max_offset_sec (default 10s)
estimate_offsets_robust(audio_dir, max_offset_sec=30.0)
ValueError: No valid pairwise offsets found - all pairs below confidence threshold
Cause: min_confidence too strict or genuinely no shared audioSolution:
  • Lower min_confidence from 0.2 to 0.1
  • Check that videos actually have audio (ffprobe -i video.mp4)
  • Use visual sync instead

Configuration

Entry Point

# audio_sync.py:241-291
def estimate_offsets_robust(audio_dir: str, 
                           max_offset_sec: float = 10.0,
                           window_sec: Optional[float] = 30.0,
                           min_confidence: float = 0.2,
                           outlier_threshold: float = 0.5) -> Dict[str, float]:
    """
    Robust offset estimation using pairwise alignment + global optimization.
    """

Parameters

audio_dir
str
required
Directory containing WAV files (extracted from videos)
max_offset_sec
float
default:"10.0"
Maximum expected offset between any two files. Search range = ±max_offset_sec.
window_sec
float | None
default:"30.0"
Use only first N seconds of audio for speed. Set to None to use full audio.
min_confidence
float
default:"0.2"
Skip pairs with confidence below this threshold. Range: 0.0 to 1.0.
outlier_threshold
float
default:"0.5"
Flag pairs with residual error > N seconds after optimization.

Source Code Reference

Key functions in src/audio_sync.py:
  • Line 27: compute_gcc_phat() - Core GCC-PHAT algorithm
  • Line 120: compute_pairwise_offsets() - All-pairs alignment
  • Line 199: optimize_offsets() - Global least-squares solver
  • Line 241: estimate_offsets_robust() - Main entry point

Next Steps

Visual Sync

Learn about motion-based alignment

Offset Semantics

Understand how offsets are applied

Build docs developers (and LLMs) love