Audio synchronization uses GCC-PHAT (Generalized Cross-Correlation with Phase Transform) to align videos by correlating their audio tracks. This method provides sub-millisecond precision when videos share common sound events.
GCC-PHAT is a frequency-domain cross-correlation technique that emphasizes phase information while de-emphasizing magnitude. This makes it robust to differences in microphone placement, gain, and frequency response.
# audio_sync.py:27-118def compute_gcc_phat(sig_a: np.ndarray, sig_b: np.ndarray, fs: int, max_offset_sec: float = 10.0, window_sec: Optional[float] = None) -> Tuple[float, float]: """ Compute time offset between two signals using GCC-PHAT. Args: sig_a: Reference signal sig_b: Signal to align fs: Sample rate (Hz) max_offset_sec: Maximum expected offset (default 10s) window_sec: Use only first N seconds for speed (default: use all) Returns: (offset_seconds, confidence_score) offset_seconds is the amount to add to sig_b timestamps to align to sig_a. """
Focuses on speech/ambient range, removes DC offset and high-frequency noise
Why? Most relevant audio content (speech, footsteps, ambient) falls in this range. Filtering improves SNR and reduces impact of low-frequency rumble or ultrasonic noise.
2
Windowing (Optional)
# audio_sync.py:48-51if window_sec is not None: window_samples = int(window_sec * fs) sig_a = sig_a[:window_samples] sig_b = sig_b[:window_samples]
# audio_sync.py:143-151signals = {}sample_rates = {}for w in wavs: path = os.path.join(audio_dir, w) sig, sr = load_audio(path) signals[w] = sig sample_rates[w] = sr
All audio files loaded into memory for fast pairwise processing.
# audio_sync.py:154-169ref_sr = max(set(sample_rates.values()), key=list(sample_rates.values()).count)for w in wavs: if sample_rates[w] != ref_sr: sig = signals[w] sr = sample_rates[w] duration = len(sig) / sr new_len = int(round(duration * ref_sr)) signals[w] = np.interp( np.linspace(0, len(sig), new_len, endpoint=False), np.arange(len(sig)), sig ).astype(np.float32)
All signals resampled to most common sample rate (typically 44100 Hz or 48000 Hz) for consistent correlation.
# audio_sync.py:176-193for i, file_a in enumerate(wavs): for j, file_b in enumerate(wavs[i+1:], start=i+1): sig_a = signals[file_a] sig_b = signals[file_b] offset, conf = compute_gcc_phat( sig_a, sig_b, ref_sr, max_offset_sec=max_offset_sec, window_sec=window_sec ) if conf >= min_confidence: pairwise[(file_a, file_b)] = (offset, conf) else: logger.warning("SKIPPED (confidence=%.3f)", conf)
Raw pairwise offsets may be inconsistent (e.g., A→B = 1s, B→C = 2s, C→A = -2.5s violates triangle inequality). Global optimization finds the best-fit offsets:
# audio_sync.py:199-239def optimize_offsets(pairwise: Dict[Tuple[str, str], Tuple[float, float]], wavs: List[str]) -> Dict[str, float]: """ Find globally consistent offsets using weighted least-squares. Minimizes: Σ w_AB * (offset_B - offset_A - d_AB)² """ def residuals(offsets): res = [] for (file_a, file_b), (d_ab, conf) in pairwise.items(): i = file_to_idx[file_a] j = file_to_idx[file_b] # offset_B - offset_A should equal d_AB error = offsets[j] - offsets[i] - d_ab res.append(np.sqrt(conf) * error) # Weight by sqrt(confidence) return np.array(res) x0 = np.zeros(n) result = least_squares(residuals, x0, loss='soft_l1', f_scale=0.1) offsets_opt = result.x - result.x[0] # Anchor first file to 0
Outliers indicate pairwise sync failures (e.g., no shared audio between those two files). The global solution uses other pairs to infer reasonable offsets.