Skip to main content
The visualize_results.py script generates publication-ready plots at 300 DPI with a clean, modern style. All plots use consistent colors: blue for audio, orange for visual.

Quick Reference

Error vs Offset

Grouped bar chart — MAE by offset magnitude

Confidence vs Error

Scatter plot — Confidence score reliability

Audio-Video Diff Histogram

Histogram — Cross-method agreement distribution

Runtime Comparison

Bar chart — Mean runtime by method

Error Distribution Boxplot

Boxplot + scatter — Error distribution by method and offset

Resource Usage

Dual bar chart — Peak CPU and memory

Motion Before/After

Per-case overlay — Motion signals pre/post alignment

Sync Timelines

Per-case diagram — Timeline bars with offset arrows

Plot Style

All plots use a custom matplotlib style for consistency:
visualize_results.py:54-71
plt.rcParams.update({
    "font.family": "sans-serif",
    "font.size": 11,
    "axes.titlesize": 13,
    "axes.facecolor": "#FAFAFA",
    "axes.grid": True,
    "grid.alpha": 0.3,
    "grid.linestyle": "--",
    "axes.spines.top": False,
    "axes.spines.right": False,
})
Colors:
  • Audio: #2196F3 (Material Blue)
  • Visual: #FF9800 (Material Orange)
  • Neutral: #7E57C2 (Material Purple for cross-method plots)

1. Error vs Offset

File: error_vs_offset.png Description: Grouped bar chart showing MAE for each true offset magnitude, split by method.

Use Cases

  • Identify if accuracy degrades at extreme offsets (e.g., ±1000 ms)
  • Compare audio vs visual performance across offset ranges
  • Detect asymmetry (e.g., better performance on positive vs negative offsets)

Implementation

visualize_results.py:78-114
def plot_error_vs_offset(df: pd.DataFrame, output_dir: str):
    offsets = sorted(df["true_offset_ms"].unique())
    x = np.arange(len(offsets))
    width = 0.35
    
    for i, method in enumerate(("audio", "visual")):
        mdf = df[df["method_type"] == method]
        maes = []
        for off in offsets:
            subset = mdf[mdf["true_offset_ms"] == off]
            maes.append(subset["absolute_error_ms"].mean() if not subset.empty else 0)
        ax.bar(x + (i - 0.5) * width, maes, width, label=method.capitalize(), color=COLORS[method])

Interpretation

MAE is roughly constant across all offsets → method is robust to offset magnitude.

2. Confidence vs Error

File: confidence_vs_error.png Description: Scatter plot of confidence score vs absolute error with linear regression lines.

Use Cases

  • Validate that confidence scores reliably predict error magnitude
  • Identify outliers (high confidence but high error, or vice versa)
  • Compare confidence calibration between audio and visual methods

Implementation

visualize_results.py:121-167
def plot_confidence_vs_error(df: pd.DataFrame, output_dir: str):
    for method in ("audio", "visual"):
        ax.scatter(mdf["confidence_score"], mdf["absolute_error_ms"], 
                   label=method.capitalize(), color=COLORS[method], alpha=0.6)
        
        # Regression line
        z = np.polyfit(confs, errors, 1)
        p = np.poly1d(z)
        ax.plot(xs, p(xs), "--", color=COLORS[method], alpha=0.7)
        
        # Annotate Pearson r
        r = np.corrcoef(confs, errors)[0, 1]
        ax.annotate(f"{method}: r={r:.3f}", ...)

Interpretation

Pearson r = -0.52 (audio)
→ Strong negative correlation
→ High confidence reliably predicts low error
→ Use confidence for filtering or weighted averaging
Outliers to Investigate:
  • High confidence, high error: False positive — method is confident but wrong (e.g., periodic motion creates multiple correlation peaks)
  • Low confidence, low error: False negative — method is uncertain but correct (e.g., weak signal but still recoverable)

3. Audio-Video Diff Histogram

File: audio_video_diff_histogram.png Description: Histogram of |audio_estimate - visual_estimate| across all test cases.

Use Cases

  • Visualize cross-method agreement distribution
  • Identify if disagreement is centered around a systematic bias or scattered
  • Assess feasibility of hybrid strategies (e.g., average both estimates if diff < threshold)

Implementation

visualize_results.py:174-212
def plot_audio_video_diff_histogram(df: pd.DataFrame, output_dir: str):
    diffs = np.abs(audio_df.loc[common].values - visual_df.loc[common].values)
    
    ax.hist(diffs, bins=20, color="#7E57C2", alpha=0.85)
    ax.axvline(np.mean(diffs), color="#D32F2F", linestyle="--", 
               label=f"Mean = {np.mean(diffs):.1f} ms")
    ax.axvline(np.median(diffs), color="#388E3C", linestyle="--",
               label=f"Median = {np.median(diffs):.1f} ms")

Interpretation

Strong agreement → both methods likely correct → safe to average estimates or use either method

4. Runtime Comparison

File: runtime_comparison.png Description: Bar chart of mean runtime per method with standard deviation error bars.

Use Cases

  • Compare efficiency between audio and visual methods
  • Estimate total pipeline execution time
  • Identify if runtime variance is high (may indicate video-dependent bottlenecks)

Implementation

visualize_results.py:219-260
def plot_runtime_comparison(df: pd.DataFrame, output_dir: str):
    methods = []
    means = []
    stds = []
    
    for method in ("audio", "visual"):
        methods.append(method.capitalize())
        means.append(mdf["runtime_seconds"].mean())
        stds.append(mdf["runtime_seconds"].std())
    
    ax.bar(methods, means, yerr=stds, color=colors, capsize=5)

Interpretation

Typical Results:
  • Audio: 2-5 seconds (FFmpeg extraction + GCC-PHAT)
  • Visual: 3-10 seconds (frame extraction + motion correlation)
If audio is significantly slower, check if FFmpeg is using hardware acceleration. If visual is significantly slower, consider reducing frame sampling rate or correlation window.

5. Error Distribution Boxplot

File: error_distribution_boxplot.png Description: Side-by-side boxplots of absolute error grouped by true offset and method, with overlaid scatter points.

Use Cases

  • Visualize error distribution shape (median, quartiles, outliers)
  • Compare variability between methods
  • Identify offset-specific failure modes

Implementation

visualize_results.py:267-348
def plot_error_distribution(df: pd.DataFrame, output_dir: str):
    # Boxplots for audio and visual at each offset
    bp_audio = ax.boxplot(audio_data, positions=positions - box_width / 2, 
                          patch_artist=True, showfliers=False)
    bp_visual = ax.boxplot(visual_data, positions=positions + box_width / 2,
                           patch_artist=True, showfliers=False)
    
    # Overlay scatter points with jitter
    for i, off in enumerate(offsets):
        jitter = np.random.default_rng(42).uniform(-0.06, 0.06, size=len(a))
        ax.scatter(np.full_like(a, i - box_width / 2) + jitter, a, ...)

Interpretation

Low variance → consistent performance → method is stable
Reading Boxplots:
  • Box: Interquartile range (25th to 75th percentile)
  • Horizontal line: Median (50th percentile)
  • Whiskers: Extend to 1.5 × IQR (outliers beyond are typically shown as points, but showfliers=False here)

6. Resource Usage

File: resource_usage.png Description: Dual bar chart (side-by-side) showing peak CPU% and peak memory (MB) by method.

Use Cases

  • Ensure pipeline fits within system constraints
  • Identify resource bottlenecks (CPU-bound vs memory-bound)
  • Compare resource efficiency between methods

Implementation

visualize_results.py:355-407
def plot_resource_usage(df: pd.DataFrame, output_dir: str):
    fig, (ax_cpu, ax_mem) = plt.subplots(1, 2, figsize=FIGSIZE_WIDE)
    
    for ax, col, ylabel, title, fmt in [
        (ax_cpu, "peak_cpu_percent", "Peak CPU (%)", ..., "{:.1f}%"),
        (ax_mem, "peak_memory_mb", "Peak Memory (MB)", ..., "{:.0f} MB"),
    ]:
        ax.bar(methods, means, yerr=stds, color=colors, capsize=5)

Interpretation

Peak CPU: 45-60%
→ Single-threaded NumPy/SciPy operations
→ Consider multi-threading or GPU-accelerated cross-correlation
If peak memory exceeds 1 GB, check if:
  • Very long videos are being processed without downsampling
  • Correlation window is too large (e.g., max_offset_sec > 30)
  • Temporary files are not being cleaned up (check /tmp)

7. Motion Before/After Overlay

Files: before_after/*.png (one per test case) Description: Two-panel plot showing original vs synthetic motion signals before alignment (top) and after applying the estimated offset (bottom).

Use Cases

  • Visually validate that alignment improves signal overlap
  • Debug cases where visual sync fails (e.g., periodic motion, low signal-to-noise ratio)
  • Generate figures for publications or presentations

Implementation

visualize_results.py:414-484
def plot_motion_before_after(df, output_dir, diagnostics_dir):
    # Load motion signals from diagnostics .npz files
    data = np.load(npz_path)
    original = data["original"]
    synthetic = data["synthetic"]
    est_offset_ms = float(data["estimated_offset_ms"])
    
    # "After" = shift synthetic backwards by estimated offset
    shift_sec = est_offset_ms / 1000.0
    t_synth_aligned = t_synth - shift_sec
    
    # Plot before (raw) and after (aligned)
    ax_before.plot(t_orig, original, label="Original")
    ax_before.plot(t_synth, synthetic, label="Synthetic (raw)")
    
    ax_after.plot(t_orig, original, label="Original")
    ax_after.plot(t_synth_aligned, synthetic, label="Synthetic (aligned)")

Interpretation

After plot shows strong peak overlap → visual sync correctly identified the offset
Requires Diagnostics: This plot requires .npz files generated by run_batch.py. If missing, re-run batch synchronization with the latest version.

8. Sync Timelines

Files: timelines/*.png (one per test case) Description: Timeline diagram showing original and synthetic video bars, with arrows indicating true offset and per-method estimated offsets.

Use Cases

  • Visualize the temporal relationship between original and synthetic videos
  • Compare how audio and visual methods estimated the offset
  • Annotate pad vs trim operations for clarity

Implementation

visualize_results.py:491-597
def plot_sync_timelines(df, output_dir):
    # Original bar (always starts at 0)
    ax.barh(y_orig, vid_len, left=0, color="#78909C", label="Original")
    
    # Synthetic bar (shifted by true offset)
    synth_start = true_offset_sec
    ax.barh(y_synth, vid_len, left=synth_start, color="#B0BEC5", label="Synthetic")
    
    # Arrow for true offset
    ax.annotate("", xy=(synth_start, y), xytext=(0, y), 
                arrowprops=dict(arrowstyle="->", color="#D32F2F"))
    
    # Arrows for estimated offsets (audio & visual)
    for method, est_sec in method_ests.items():
        ax.annotate("", xy=(est_sec, y), xytext=(0, y),
                    arrowprops=dict(arrowstyle="->", color=COLORS[method], linestyle="--"))

Diagram Components

1

Gray Bars

Horizontal bars represent video duration. Original is always at y=1.0, synthetic at y=0.0.
2

Red Arrow (Solid)

True offset — shows ground truth shift applied during offset generation.
  • Rightward arrow: Positive offset (padding)
  • Leftward arrow: Negative offset (trimming)
3

Blue/Orange Arrows (Dashed)

Estimated offsets from audio (blue) and visual (orange) sync.
  • Length of arrow = magnitude of estimated offset
  • Annotation shows pad or trim based on sign

Interpretation

Dashed arrows overlap with solid red arrow → method correctly estimated the offset

Output Directory Structure

evaluation/plots/
├── error_vs_offset.png
├── confidence_vs_error.png
├── audio_video_diff_histogram.png
├── runtime_comparison.png
├── error_distribution_boxplot.png
├── resource_usage.png
├── before_after/
   ├── video_a_offset-1000.png
   ├── video_a_offset-500.png
   ├── video_a_offset-100.png
   ├── video_a_offset+100.png
   ├── video_a_offset+500.png
   ├── video_a_offset+1000.png
   └── ... (18 more files for other videos)
└── timelines/
    ├── video_a_offset-1000.png
    ├── video_a_offset-500.png
    └── ... (24 files total)
All plots are saved at 300 DPI for print-quality output. Total disk usage: ~10-20 MB for 24 test cases.

Customization

Changing Colors

visualize_results.py:44
COLORS = {"audio": "#2196F3", "visual": "#FF9800"}

Changing DPI

visualize_results.py:45
DPI = 300  # Increase to 600 for higher resolution

Changing Plot Size

visualize_results.py:46-47
FIGSIZE_WIDE = (10, 5)    # Width, height in inches
FIGSIZE_SQUARE = (7, 6)

Disabling Specific Plots

Comment out the corresponding function call in generate_plots():
visualize_results.py:618-625
def generate_plots(results_csv, plots_dir):
    # ... setup ...
    
    plot_error_vs_offset(df, plots_dir)
    plot_confidence_vs_error(df, plots_dir)
    # plot_audio_video_diff_histogram(df, plots_dir)  # Disabled
    plot_runtime_comparison(df, plots_dir)
    # ...

Next Steps

Metrics Reference

Understand how metrics are computed from results.csv

Workflow Guide

Return to the step-by-step pipeline instructions

Build docs developers (and LLMs) love