Skip to main content
The evaluation suite runs in four independent stages. Each stage produces artifacts used by subsequent stages, but you can re-run later stages without regenerating earlier outputs.

Prerequisites

Source Videos

Place 4 original videos (.mp4, .mov, or .avi) in evaluation/originals/.
  • Videos should be at least 10 seconds long
  • At least one video should contain audio for audio sync evaluation
  • Varied motion levels and audio characteristics enable better sensitivity analysis
Ensure FFmpeg is installed and accessible in your system PATH. The pipeline uses ffmpeg and ffprobe for video manipulation and metadata extraction.

Stage 1: Offset Generation

Generate synthetic test cases by applying known temporal shifts to each original video.

Command

python -m evaluation.offset_generation

What It Does

For each original video and each offset in the offset list [-1000, -500, -100, +100, +500, +1000] ms:
Prepends black frames and silent audio to simulate a delayed start.FFmpeg Strategy:
  • Uses color and aevalsrc filters to generate black video + silence
  • Concatenates the padding with the original using filter_complex
  • Example: +500ms → prepend 0.5 seconds of black/silence

Sensitivity Tag Computation

For each original video, the script computes three metadata tags:
1

Video Duration

Uses ffprobe to extract total duration in seconds.
2

Motion Level

Samples the first 300 frames, computes mean frame-to-frame pixel difference, normalized to [0, 1].
  • Low motion (< 0.05): Static shots, minimal movement
  • High motion (> 0.2): Fast-paced action, camera motion
3

Audio Energy Level

Extracts the first 10 seconds of audio, computes RMS energy, normalized to [0, 1] (capped at typical speech RMS ~0.15).
  • Low energy (< 0.25): Quiet ambient sound
  • High energy (> 0.75): Loud speech, music, or effects

Output

video_id,synthetic_file_path,true_offset_ms,video_length_sec,motion_level,audio_energy_level
video_a,evaluation/synthetic/video_a_offset_-1000ms.mp4,-1000,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_-500ms.mp4,-500,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_-100ms.mp4,-100,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_+100ms.mp4,100,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_+500ms.mp4,500,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_+1000ms.mp4,1000,45.2,0.123456,0.567890
# ... 18 more rows (3 additional videos × 6 offsets)
This stage takes ~2-5 minutes per video depending on length and encoding speed. The ultrafast preset is used to prioritize speed over compression.

Stage 2: Batch Synchronization

Run both audio (GCC-PHAT) and visual (motion-based) synchronization on every synthetic test case.

Command

python -m evaluation.run_batch

What It Does

For each row in synthetic_metadata.csv:
1

Load Test Case

  • Locate the original video in evaluation/originals/
  • Load the corresponding synthetic video from the metadata CSV
2

Run Audio Sync

  1. Extract audio from both videos using src.preprocess.extract_audio_from_videos
  2. Compute GCC-PHAT cross-correlation using src.audio_sync.estimate_offsets_robust
  3. Extract peak confidence score from the cross-correlation function
  4. Measure runtime and peak CPU/memory usage via ResourceMonitor
3

Run Visual Sync

  1. Extract motion energy timeseries from both videos
  2. Compute cross-correlation using src.visual_sync.sync_videos_by_motion
  3. Extract peak confidence score from motion correlation
  4. Save motion signals to diagnostics/*.npz for later visualization
  5. Measure runtime and peak CPU/memory usage
4

Record Results

Append two rows to results.csv (one for audio, one for visual) with:
  • estimated_offset_ms: Synchronization algorithm’s estimate
  • absolute_error_ms: |estimated_offset_ms - true_offset_ms|
  • confidence_score: Peak correlation value
  • runtime_seconds: Total processing time
  • peak_cpu_percent, peak_memory_mb: Resource usage

Resource Monitoring

The ResourceMonitor context manager samples CPU and memory usage every 200ms in a background thread:
evaluation/run_batch.py
with ResourceMonitor() as monitor:
    t0 = time.time()
    # ... run synchronization ...
    runtime = time.time() - t0

print(f"Peak CPU: {monitor.peak_cpu_percent}%")
print(f"Peak Memory: {monitor.peak_memory_mb} MB")

Output

video_id,true_offset_ms,method_type,estimated_offset_ms,absolute_error_ms,confidence_score,runtime_seconds,peak_cpu_percent,peak_memory_mb,video_length_sec,motion_level,audio_energy_level
video_a,-1000,audio,-998.5,1.5,0.9234,3.421,45.2,312.4,45.2,0.123456,0.567890
video_a,-1000,visual,-1003.2,3.2,0.8765,5.123,38.1,256.7,45.2,0.123456,0.567890
video_a,-500,audio,-501.1,1.1,0.9456,3.389,44.8,308.2,45.2,0.123456,0.567890
video_a,-500,visual,-497.8,2.2,0.8912,5.034,37.9,253.1,45.2,0.123456,0.567890
# ... 44 more rows
This stage takes ~5-10 minutes per test case depending on video length. The total runtime for 24 cases is typically 2-4 hours.

Stage 3: Metrics Computation

Aggregate results into statistical summaries across five categories.

Command

python -m evaluation.compute_metrics

What It Does

Per-method and per-offset analysis of synchronization error:
  • MAE (Mean Absolute Error)
  • RMSE (Root Mean Square Error)
  • Median Error
  • Max Error
  • Per-Offset MAE: Breakdown by each of the 6 offset magnitudes

Output

{
  "accuracy": {
    "audio": {
      "mae_ms": 12.34,
      "rmse_ms": 15.67,
      "median_error_ms": 10.5,
      "max_error_ms": 45.2,
      "count": 24,
      "mae_per_offset_ms": {
        "-1000": 8.2,
        "-500": 11.3,
        "-100": 15.7,
        "100": 14.2,
        "500": 10.8,
        "1000": 9.1
      }
    },
    "visual": { /* ... */ }
  },
  "cross_method_agreement": {
    "mean_audio_video_diff_ms": 23.45,
    "median_audio_video_diff_ms": 18.2,
    "pct_within_threshold": 87.5,
    "threshold_ms": 100.0,
    "n_pairs": 24
  },
  "confidence_validation": { /* ... */ },
  "efficiency": { /* ... */ },
  "resource_usage": { /* ... */ },
  "grouped_by_tag": { /* ... */ }
}
The summary table is printed to the console for quick reference. The full structured metrics are available in metrics_summary.json for programmatic access.

Stage 4: Visualization

Generate 8 types of publication-ready plots from the results.

Command

python -m evaluation.visualize_results

What It Does

Reads results.csv and produces high-DPI (300 DPI) PNG files in evaluation/plots/:

Error vs Offset

Grouped bar chart of MAE by offset magnitude and method

Confidence vs Error

Scatter plot with regression lines showing confidence reliability

Audio-Video Diff Histogram

Distribution of cross-method estimate differences

Runtime Comparison

Bar chart of mean runtime by method

Error Distribution

Boxplot of error by method and offset with overlaid scatter points

Resource Usage

Dual bar chart of peak CPU and memory by method

Motion Before/After

Per-case overlay of motion signals before and after alignment

Sync Timelines

Per-case timeline diagrams with offset arrows (pad/trim)

Output

evaluation/plots/
├── error_vs_offset.png
├── confidence_vs_error.png
├── audio_video_diff_histogram.png
├── runtime_comparison.png
├── error_distribution_boxplot.png
├── resource_usage.png
├── before_after/
   ├── video_a_offset-1000.png
   ├── video_a_offset-500.png
   └── ... (24 files total)
└── timelines/
    ├── video_a_offset-1000.png
    ├── video_a_offset-500.png
    └── ... (24 files total)
All plots use a clean, publication-ready style with consistent colors (blue for audio, orange for visual) and disabled top/right spines for a modern look.

Troubleshooting

Symptom: FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'Solution: Install FFmpeg and ensure it’s in your system PATH:
  • macOS: brew install ffmpeg
  • Windows: Download from ffmpeg.org and add to PATH
  • Linux: sudo apt install ffmpeg
Symptom: RuntimeError: Original video has no audio streamSolution: Audio sync requires both videos to have audio tracks. Either:
  • Use videos with audio for all source files, or
  • Skip audio-less videos (they’ll be omitted from audio method results)
Symptom: MemoryError or system slowdown during run_batch.pySolution: The batch runner processes videos sequentially and cleans up temp files after each case. If memory usage is still high:
  • Close other applications
  • Reduce the number of source videos
  • Process videos in smaller batches by editing the metadata CSV
Symptom: Some plots are skipped or show no dataSolution:
  • before_after plots: Require diagnostics .npz files from run_batch.py. Re-run batch sync if missing.
  • Resource usage plot: Requires peak_cpu_percent and peak_memory_mb columns in results.csv. Re-run batch sync with the latest version.

Next Steps

Metrics Reference

Complete documentation of all computed metrics

Visualization Gallery

Detailed descriptions and examples of all plot types

Build docs developers (and LLMs) love