Evaluation Visualization - Multi-Camera Video Synchronization

The visualize_results.py script generates publication-ready plots at 300 DPI with a clean, modern style. All plots use consistent colors: blue for audio, orange for visual.

Quick Reference

Error vs Offset

Grouped bar chart — MAE by offset magnitude

Confidence vs Error

Scatter plot — Confidence score reliability

Audio-Video Diff Histogram

Histogram — Cross-method agreement distribution

Runtime Comparison

Bar chart — Mean runtime by method

Error Distribution Boxplot

Boxplot + scatter — Error distribution by method and offset

Resource Usage

Dual bar chart — Peak CPU and memory

Motion Before/After

Per-case overlay — Motion signals pre/post alignment

Sync Timelines

Per-case diagram — Timeline bars with offset arrows

Plot Style

All plots use a custom matplotlib style for consistency:

visualize_results.py:54-71

plt.rcParams.update({
    "font.family": "sans-serif",
    "font.size": 11,
    "axes.titlesize": 13,
    "axes.facecolor": "#FAFAFA",
    "axes.grid": True,
    "grid.alpha": 0.3,
    "grid.linestyle": "--",
    "axes.spines.top": False,
    "axes.spines.right": False,
})

Colors:

Audio: #2196F3 (Material Blue)
Visual: #FF9800 (Material Orange)
Neutral: #7E57C2 (Material Purple for cross-method plots)

1. Error vs Offset

File: error_vs_offset.png Description: Grouped bar chart showing MAE for each true offset magnitude, split by method.

Use Cases

Identify if accuracy degrades at extreme offsets (e.g., ±1000 ms)
Compare audio vs visual performance across offset ranges
Detect asymmetry (e.g., better performance on positive vs negative offsets)

Implementation

visualize_results.py:78-114

def plot_error_vs_offset(df: pd.DataFrame, output_dir: str):
    offsets = sorted(df["true_offset_ms"].unique())
    x = np.arange(len(offsets))
    width = 0.35
    
    for i, method in enumerate(("audio", "visual")):
        mdf = df[df["method_type"] == method]
        maes = []
        for off in offsets:
            subset = mdf[mdf["true_offset_ms"] == off]
            maes.append(subset["absolute_error_ms"].mean() if not subset.empty else 0)
        ax.bar(x + (i - 0.5) * width, maes, width, label=method.capitalize(), color=COLORS[method])

Interpretation

Ideal Pattern
Edge Degradation
Asymmetry

MAE is roughly constant across all offsets → method is robust to offset magnitude.

2. Confidence vs Error

File: confidence_vs_error.png Description: Scatter plot of confidence score vs absolute error with linear regression lines.

Use Cases

Validate that confidence scores reliably predict error magnitude
Identify outliers (high confidence but high error, or vice versa)
Compare confidence calibration between audio and visual methods

Implementation

visualize_results.py:121-167

def plot_confidence_vs_error(df: pd.DataFrame, output_dir: str):
    for method in ("audio", "visual"):
        ax.scatter(mdf["confidence_score"], mdf["absolute_error_ms"], 
                   label=method.capitalize(), color=COLORS[method], alpha=0.6)
        
        # Regression line
        z = np.polyfit(confs, errors, 1)
        p = np.poly1d(z)
        ax.plot(xs, p(xs), "--", color=COLORS[method], alpha=0.7)
        
        # Annotate Pearson r
        r = np.corrcoef(confs, errors)[0, 1]
        ax.annotate(f"{method}: r={r:.3f}", ...)

Interpretation

Pearson r = -0.52 (audio)
→ Strong negative correlation
→ High confidence reliably predicts low error
→ Use confidence for filtering or weighted averaging

Outliers to Investigate:

High confidence, high error: False positive — method is confident but wrong (e.g., periodic motion creates multiple correlation peaks)
Low confidence, low error: False negative — method is uncertain but correct (e.g., weak signal but still recoverable)

3. Audio-Video Diff Histogram

File: audio_video_diff_histogram.png Description: Histogram of |audio_estimate - visual_estimate| across all test cases.

Use Cases

Visualize cross-method agreement distribution
Identify if disagreement is centered around a systematic bias or scattered
Assess feasibility of hybrid strategies (e.g., average both estimates if diff < threshold)

Implementation

visualize_results.py:174-212

def plot_audio_video_diff_histogram(df: pd.DataFrame, output_dir: str):
    diffs = np.abs(audio_df.loc[common].values - visual_df.loc[common].values)
    
    ax.hist(diffs, bins=20, color="#7E57C2", alpha=0.85)
    ax.axvline(np.mean(diffs), color="#D32F2F", linestyle="--", 
               label=f"Mean = {np.mean(diffs):.1f} ms")
    ax.axvline(np.median(diffs), color="#388E3C", linestyle="--",
               label=f"Median = {np.median(diffs):.1f} ms")

Interpretation

Tight Distribution (σ < 30 ms)
Wide Distribution (σ > 50 ms)
Bimodal Distribution

Strong agreement → both methods likely correct → safe to average estimates or use either method

4. Runtime Comparison

File: runtime_comparison.png Description: Bar chart of mean runtime per method with standard deviation error bars.

Use Cases

Compare efficiency between audio and visual methods
Estimate total pipeline execution time
Identify if runtime variance is high (may indicate video-dependent bottlenecks)

Implementation

visualize_results.py:219-260

def plot_runtime_comparison(df: pd.DataFrame, output_dir: str):
    methods = []
    means = []
    stds = []
    
    for method in ("audio", "visual"):
        methods.append(method.capitalize())
        means.append(mdf["runtime_seconds"].mean())
        stds.append(mdf["runtime_seconds"].std())
    
    ax.bar(methods, means, yerr=stds, color=colors, capsize=5)

Interpretation

Typical Results:

Audio: 2-5 seconds (FFmpeg extraction + GCC-PHAT)
Visual: 3-10 seconds (frame extraction + motion correlation)

If audio is significantly slower, check if FFmpeg is using hardware acceleration. If visual is significantly slower, consider reducing frame sampling rate or correlation window.

5. Error Distribution Boxplot

File: error_distribution_boxplot.png Description: Side-by-side boxplots of absolute error grouped by true offset and method, with overlaid scatter points.

Use Cases

Visualize error distribution shape (median, quartiles, outliers)
Compare variability between methods
Identify offset-specific failure modes

Implementation

visualize_results.py:267-348

def plot_error_distribution(df: pd.DataFrame, output_dir: str):
    # Boxplots for audio and visual at each offset
    bp_audio = ax.boxplot(audio_data, positions=positions - box_width / 2, 
                          patch_artist=True, showfliers=False)
    bp_visual = ax.boxplot(visual_data, positions=positions + box_width / 2,
                           patch_artist=True, showfliers=False)
    
    # Overlay scatter points with jitter
    for i, off in enumerate(offsets):
        jitter = np.random.default_rng(42).uniform(-0.06, 0.06, size=len(a))
        ax.scatter(np.full_like(a, i - box_width / 2) + jitter, a, ...)

Interpretation

Narrow Box (IQR < 10 ms)
Wide Box (IQR > 30 ms)
Median != Mean (skewed distribution)

Low variance → consistent performance → method is stable

Reading Boxplots:

Box: Interquartile range (25th to 75th percentile)
Horizontal line: Median (50th percentile)
Whiskers: Extend to 1.5 × IQR (outliers beyond are typically shown as points, but showfliers=False here)

6. Resource Usage

File: resource_usage.png Description: Dual bar chart (side-by-side) showing peak CPU% and peak memory (MB) by method.

Use Cases

Ensure pipeline fits within system constraints
Identify resource bottlenecks (CPU-bound vs memory-bound)
Compare resource efficiency between methods

Implementation

visualize_results.py:355-407

def plot_resource_usage(df: pd.DataFrame, output_dir: str):
    fig, (ax_cpu, ax_mem) = plt.subplots(1, 2, figsize=FIGSIZE_WIDE)
    
    for ax, col, ylabel, title, fmt in [
        (ax_cpu, "peak_cpu_percent", "Peak CPU (%)", ..., "{:.1f}%"),
        (ax_mem, "peak_memory_mb", "Peak Memory (MB)", ..., "{:.0f} MB"),
    ]:
        ax.bar(methods, means, yerr=stds, color=colors, capsize=5)

Interpretation

Peak CPU: 45-60%
→ Single-threaded NumPy/SciPy operations
→ Consider multi-threading or GPU-accelerated cross-correlation

If peak memory exceeds 1 GB, check if:

Very long videos are being processed without downsampling
Correlation window is too large (e.g., max_offset_sec > 30)
Temporary files are not being cleaned up (check /tmp)

7. Motion Before/After Overlay

Files: before_after/*.png (one per test case) Description: Two-panel plot showing original vs synthetic motion signals before alignment (top) and after applying the estimated offset (bottom).

Use Cases

Visually validate that alignment improves signal overlap
Debug cases where visual sync fails (e.g., periodic motion, low signal-to-noise ratio)
Generate figures for publications or presentations

Implementation

visualize_results.py:414-484

def plot_motion_before_after(df, output_dir, diagnostics_dir):
    # Load motion signals from diagnostics .npz files
    data = np.load(npz_path)
    original = data["original"]
    synthetic = data["synthetic"]
    est_offset_ms = float(data["estimated_offset_ms"])
    
    # "After" = shift synthetic backwards by estimated offset
    shift_sec = est_offset_ms / 1000.0
    t_synth_aligned = t_synth - shift_sec
    
    # Plot before (raw) and after (aligned)
    ax_before.plot(t_orig, original, label="Original")
    ax_before.plot(t_synth, synthetic, label="Synthetic (raw)")
    
    ax_after.plot(t_orig, original, label="Original")
    ax_after.plot(t_synth_aligned, synthetic, label="Synthetic (aligned)")

Interpretation

Good Alignment
Partial Alignment
No Alignment

After plot shows strong peak overlap → visual sync correctly identified the offset

Requires Diagnostics: This plot requires .npz files generated by run_batch.py. If missing, re-run batch synchronization with the latest version.

8. Sync Timelines

Files: timelines/*.png (one per test case) Description: Timeline diagram showing original and synthetic video bars, with arrows indicating true offset and per-method estimated offsets.

Use Cases

Visualize the temporal relationship between original and synthetic videos
Compare how audio and visual methods estimated the offset
Annotate pad vs trim operations for clarity

Implementation

visualize_results.py:491-597

def plot_sync_timelines(df, output_dir):
    # Original bar (always starts at 0)
    ax.barh(y_orig, vid_len, left=0, color="#78909C", label="Original")
    
    # Synthetic bar (shifted by true offset)
    synth_start = true_offset_sec
    ax.barh(y_synth, vid_len, left=synth_start, color="#B0BEC5", label="Synthetic")
    
    # Arrow for true offset
    ax.annotate("", xy=(synth_start, y), xytext=(0, y), 
                arrowprops=dict(arrowstyle="->", color="#D32F2F"))
    
    # Arrows for estimated offsets (audio & visual)
    for method, est_sec in method_ests.items():
        ax.annotate("", xy=(est_sec, y), xytext=(0, y),
                    arrowprops=dict(arrowstyle="->", color=COLORS[method], linestyle="--"))

Diagram Components

Gray Bars

Horizontal bars represent video duration. Original is always at y=1.0, synthetic at y=0.0.

Red Arrow (Solid)

True offset — shows ground truth shift applied during offset generation.

Rightward arrow: Positive offset (padding)
Leftward arrow: Negative offset (trimming)

Blue/Orange Arrows (Dashed)

Estimated offsets from audio (blue) and visual (orange) sync.

Length of arrow = magnitude of estimated offset
Annotation shows pad or trim based on sign

Interpretation

Perfect Estimate
Systematic Bias
Method Disagreement

Dashed arrows overlap with solid red arrow → method correctly estimated the offset

Output Directory Structure

evaluation/plots/
├── error_vs_offset.png
├── confidence_vs_error.png
├── audio_video_diff_histogram.png
├── runtime_comparison.png
├── error_distribution_boxplot.png
├── resource_usage.png
├── before_after/
│   ├── video_a_offset-1000.png
│   ├── video_a_offset-500.png
│   ├── video_a_offset-100.png
│   ├── video_a_offset+100.png
│   ├── video_a_offset+500.png
│   ├── video_a_offset+1000.png
│   └── ... (18 more files for other videos)
└── timelines/
    ├── video_a_offset-1000.png
    ├── video_a_offset-500.png
    └── ... (24 files total)

All plots are saved at 300 DPI for print-quality output. Total disk usage: ~10-20 MB for 24 test cases.

Customization

Changing Colors

visualize_results.py:44

COLORS = {"audio": "#2196F3", "visual": "#FF9800"}

Changing DPI

visualize_results.py:45

DPI = 300  # Increase to 600 for higher resolution

Changing Plot Size

visualize_results.py:46-47

FIGSIZE_WIDE = (10, 5)    # Width, height in inches
FIGSIZE_SQUARE = (7, 6)

Disabling Specific Plots

Comment out the corresponding function call in generate_plots():

visualize_results.py:618-625

def generate_plots(results_csv, plots_dir):
    # ... setup ...
    
    plot_error_vs_offset(df, plots_dir)
    plot_confidence_vs_error(df, plots_dir)
    # plot_audio_video_diff_histogram(df, plots_dir)  # Disabled
    plot_runtime_comparison(df, plots_dir)
    # ...

Get Started

Core Concepts

User Guide

Evaluation Suite

​Quick Reference

Error vs Offset

Confidence vs Error

Audio-Video Diff Histogram

Runtime Comparison

Error Distribution Boxplot

Resource Usage

Motion Before/After

Sync Timelines

​Plot Style

​1. Error vs Offset

​Use Cases

​Implementation

​Interpretation

​2. Confidence vs Error

​Use Cases

​Implementation

​Interpretation

​3. Audio-Video Diff Histogram

​Use Cases

​Implementation

​Interpretation

​4. Runtime Comparison

​Use Cases

​Implementation

​Interpretation

​5. Error Distribution Boxplot

​Use Cases

​Implementation

​Interpretation

​6. Resource Usage

​Use Cases

​Implementation

​Interpretation

​7. Motion Before/After Overlay

​Use Cases

​Implementation

​Interpretation

​8. Sync Timelines

​Use Cases

​Implementation

​Diagram Components

​Interpretation

​Output Directory Structure

​Customization

​Changing Colors

​Changing DPI

​Changing Plot Size

​Disabling Specific Plots

​Next Steps

Metrics Reference

Workflow Guide

Build docs developers (and LLMs) love

Quick Reference

Plot Style

1. Error vs Offset

Use Cases

Implementation

Interpretation

2. Confidence vs Error

Use Cases

Implementation

Interpretation

3. Audio-Video Diff Histogram

Use Cases

Implementation

Interpretation

4. Runtime Comparison

Use Cases

Implementation

Interpretation

5. Error Distribution Boxplot

Use Cases

Implementation

Interpretation

6. Resource Usage

Use Cases

Implementation

Interpretation

7. Motion Before/After Overlay

Use Cases

Implementation

Interpretation

8. Sync Timelines

Use Cases

Implementation

Diagram Components

Interpretation

Output Directory Structure

Customization

Changing Colors

Changing DPI

Changing Plot Size

Disabling Specific Plots

Next Steps