Evaluation Workflow

The evaluation suite runs in four independent stages. Each stage produces artifacts used by subsequent stages, but you can re-run later stages without regenerating earlier outputs.

Prerequisites

Source Videos

Place 4 original videos (.mp4, .mov, or .avi) in evaluation/originals/.

Videos should be at least 10 seconds long
At least one video should contain audio for audio sync evaluation
Varied motion levels and audio characteristics enable better sensitivity analysis

Ensure FFmpeg is installed and accessible in your system PATH. The pipeline uses ffmpeg and ffprobe for video manipulation and metadata extraction.

Stage 1: Offset Generation

Generate synthetic test cases by applying known temporal shifts to each original video.

Command

python -m evaluation.offset_generation

What It Does

For each original video and each offset in the offset list [-1000, -500, -100, +100, +500, +1000] ms:

Positive Offset (Padding)
Negative Offset (Trimming)

Prepends black frames and silent audio to simulate a delayed start.FFmpeg Strategy:

Uses color and aevalsrc filters to generate black video + silence
Concatenates the padding with the original using filter_complex
Example: +500ms → prepend 0.5 seconds of black/silence

Trims frames from the start to simulate an early start.FFmpeg Strategy:

Uses -ss to seek forward before reading input
Example: -500ms → skip first 0.5 seconds

Sensitivity Tag Computation

For each original video, the script computes three metadata tags:

Video Duration

Uses ffprobe to extract total duration in seconds.

Motion Level

Samples the first 300 frames, computes mean frame-to-frame pixel difference, normalized to [0, 1].

Low motion (< 0.05): Static shots, minimal movement
High motion (> 0.2): Fast-paced action, camera motion

Audio Energy Level

Extracts the first 10 seconds of audio, computes RMS energy, normalized to [0, 1] (capped at typical speech RMS ~0.15).

Low energy (< 0.25): Quiet ambient sound
High energy (> 0.75): Loud speech, music, or effects

Output

video_id,synthetic_file_path,true_offset_ms,video_length_sec,motion_level,audio_energy_level
video_a,evaluation/synthetic/video_a_offset_-1000ms.mp4,-1000,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_-500ms.mp4,-500,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_-100ms.mp4,-100,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_+100ms.mp4,100,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_+500ms.mp4,500,45.2,0.123456,0.567890
video_a,evaluation/synthetic/video_a_offset_+1000ms.mp4,1000,45.2,0.123456,0.567890
# ... 18 more rows (3 additional videos × 6 offsets)

This stage takes ~2-5 minutes per video depending on length and encoding speed. The ultrafast preset is used to prioritize speed over compression.

Stage 2: Batch Synchronization

Run both audio (GCC-PHAT) and visual (motion-based) synchronization on every synthetic test case.

Command

python -m evaluation.run_batch

What It Does

For each row in synthetic_metadata.csv:

Load Test Case

Locate the original video in evaluation/originals/
Load the corresponding synthetic video from the metadata CSV

Run Audio Sync

Extract audio from both videos using src.preprocess.extract_audio_from_videos
Compute GCC-PHAT cross-correlation using src.audio_sync.estimate_offsets_robust
Extract peak confidence score from the cross-correlation function
Measure runtime and peak CPU/memory usage via ResourceMonitor

Run Visual Sync

Extract motion energy timeseries from both videos
Compute cross-correlation using src.visual_sync.sync_videos_by_motion
Extract peak confidence score from motion correlation
Save motion signals to diagnostics/*.npz for later visualization
Measure runtime and peak CPU/memory usage

Record Results

Append two rows to results.csv (one for audio, one for visual) with:

estimated_offset_ms: Synchronization algorithm’s estimate
absolute_error_ms: |estimated_offset_ms - true_offset_ms|
confidence_score: Peak correlation value
runtime_seconds: Total processing time
peak_cpu_percent, peak_memory_mb: Resource usage

Resource Monitoring

The ResourceMonitor context manager samples CPU and memory usage every 200ms in a background thread:

evaluation/run_batch.py

with ResourceMonitor() as monitor:
    t0 = time.time()
    # ... run synchronization ...
    runtime = time.time() - t0

print(f"Peak CPU: {monitor.peak_cpu_percent}%")
print(f"Peak Memory: {monitor.peak_memory_mb} MB")

Output

video_id,true_offset_ms,method_type,estimated_offset_ms,absolute_error_ms,confidence_score,runtime_seconds,peak_cpu_percent,peak_memory_mb,video_length_sec,motion_level,audio_energy_level
video_a,-1000,audio,-998.5,1.5,0.9234,3.421,45.2,312.4,45.2,0.123456,0.567890
video_a,-1000,visual,-1003.2,3.2,0.8765,5.123,38.1,256.7,45.2,0.123456,0.567890
video_a,-500,audio,-501.1,1.1,0.9456,3.389,44.8,308.2,45.2,0.123456,0.567890
video_a,-500,visual,-497.8,2.2,0.8912,5.034,37.9,253.1,45.2,0.123456,0.567890
# ... 44 more rows

This stage takes ~5-10 minutes per test case depending on video length. The total runtime for 24 cases is typically 2-4 hours.

Stage 3: Metrics Computation

Aggregate results into statistical summaries across five categories.

Command

python -m evaluation.compute_metrics

What It Does

Per-method and per-offset analysis of synchronization error:

MAE (Mean Absolute Error)
RMSE (Root Mean Square Error)
Median Error
Max Error
Per-Offset MAE: Breakdown by each of the 6 offset magnitudes

How consistently do audio and visual methods agree?

Mean and median |audio_estimate - visual_estimate|
Percentage of cases within 100ms agreement threshold

Output

{
  "accuracy": {
    "audio": {
      "mae_ms": 12.34,
      "rmse_ms": 15.67,
      "median_error_ms": 10.5,
      "max_error_ms": 45.2,
      "count": 24,
      "mae_per_offset_ms": {
        "-1000": 8.2,
        "-500": 11.3,
        "-100": 15.7,
        "100": 14.2,
        "500": 10.8,
        "1000": 9.1
      }
    },
    "visual": { /* ... */ }
  },
  "cross_method_agreement": {
    "mean_audio_video_diff_ms": 23.45,
    "median_audio_video_diff_ms": 18.2,
    "pct_within_threshold": 87.5,
    "threshold_ms": 100.0,
    "n_pairs": 24
  },
  "confidence_validation": { /* ... */ },
  "efficiency": { /* ... */ },
  "resource_usage": { /* ... */ },
  "grouped_by_tag": { /* ... */ }
}

The summary table is printed to the console for quick reference. The full structured metrics are available in metrics_summary.json for programmatic access.

Stage 4: Visualization

Generate 8 types of publication-ready plots from the results.

Command

python -m evaluation.visualize_results

What It Does

Reads results.csv and produces high-DPI (300 DPI) PNG files in evaluation/plots/:

Error vs Offset

Grouped bar chart of MAE by offset magnitude and method

Confidence vs Error

Scatter plot with regression lines showing confidence reliability

Audio-Video Diff Histogram

Distribution of cross-method estimate differences

Runtime Comparison

Bar chart of mean runtime by method

Error Distribution

Boxplot of error by method and offset with overlaid scatter points

Resource Usage

Dual bar chart of peak CPU and memory by method

Motion Before/After

Per-case overlay of motion signals before and after alignment

Sync Timelines

Per-case timeline diagrams with offset arrows (pad/trim)

Output

evaluation/plots/
├── error_vs_offset.png
├── confidence_vs_error.png
├── audio_video_diff_histogram.png
├── runtime_comparison.png
├── error_distribution_boxplot.png
├── resource_usage.png
├── before_after/
│   ├── video_a_offset-1000.png
│   ├── video_a_offset-500.png
│   └── ... (24 files total)
└── timelines/
    ├── video_a_offset-1000.png
    ├── video_a_offset-500.png
    └── ... (24 files total)

All plots use a clean, publication-ready style with consistent colors (blue for audio, orange for visual) and disabled top/right spines for a modern look.

Troubleshooting

FFmpeg errors during offset generation

Symptom: FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'Solution: Install FFmpeg and ensure it’s in your system PATH:

macOS: brew install ffmpeg
Windows: Download from ffmpeg.org and add to PATH
Linux: sudo apt install ffmpeg

Audio sync fails for some videos

Symptom: RuntimeError: Original video has no audio streamSolution: Audio sync requires both videos to have audio tracks. Either:

Use videos with audio for all source files, or
Skip audio-less videos (they’ll be omitted from audio method results)

Out of memory during batch run

Symptom: MemoryError or system slowdown during run_batch.pySolution: The batch runner processes videos sequentially and cleans up temp files after each case. If memory usage is still high:

Close other applications
Reduce the number of source videos
Process videos in smaller batches by editing the metadata CSV

Plots missing or empty

Symptom: Some plots are skipped or show no dataSolution:

before_after plots: Require diagnostics .npz files from run_batch.py. Re-run batch sync if missing.
Resource usage plot: Requires peak_cpu_percent and peak_memory_mb columns in results.csv. Re-run batch sync with the latest version.

Get Started

Core Concepts

User Guide

Evaluation Suite

Prerequisites

Source Videos

Stage 1: Offset Generation

Command

What It Does

Sensitivity Tag Computation

Output

Stage 2: Batch Synchronization

Command

What It Does

Resource Monitoring

Output

Stage 3: Metrics Computation

Command

What It Does

Output

Stage 4: Visualization

Command

What It Does

Error vs Offset

Confidence vs Error

Audio-Video Diff Histogram

Runtime Comparison

Error Distribution

Resource Usage

Motion Before/After

Sync Timelines

Output

Troubleshooting

Next Steps

Metrics Reference

Visualization Gallery

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Evaluation Suite

​Prerequisites

Source Videos

​Stage 1: Offset Generation

​Command

​What It Does

​Sensitivity Tag Computation

​Output

​Stage 2: Batch Synchronization

​Command

​What It Does

​Resource Monitoring

​Output

​Stage 3: Metrics Computation

​Command

​What It Does

​Output

​Stage 4: Visualization

​Command

​What It Does

Error vs Offset

Confidence vs Error

Audio-Video Diff Histogram

Runtime Comparison

Error Distribution

Resource Usage

Motion Before/After

Sync Timelines

​Output

​Troubleshooting

​Next Steps

Metrics Reference

Visualization Gallery

Build docs developers (and LLMs) love

Prerequisites

Stage 1: Offset Generation

Command

What It Does

Sensitivity Tag Computation

Output

Stage 2: Batch Synchronization

Command

What It Does

Resource Monitoring

Output

Stage 3: Metrics Computation

Command

What It Does

Output

Stage 4: Visualization

Command

What It Does

Output

Troubleshooting

Next Steps