Skip to main content

Overview

The compute_metrics module analyzes the results CSV from run_batch and computes comprehensive performance metrics including accuracy (MAE, RMSE), cross-method agreement, confidence validation, efficiency, resource usage, and sensitivity analysis.

Functions

compute_metrics

compute_metrics(
    results_csv: str = RESULTS_CSV,
    metrics_dir: str = METRICS_DIR,
) -> dict
Compute all evaluation metrics from results CSV.
results_csv
str
default:"evaluation/results/results.csv"
Path to the results CSV generated by run_batch.
metrics_dir
str
default:"evaluation/metrics"
Output directory for the metrics JSON file.
Returns: Dictionary containing all computed metrics. Output Files:
  • metrics_dir/metrics_summary.json: Complete metrics in JSON format
  • Stdout: Human-readable summary table
Metrics Computed:
  1. Accuracy (per method):
    • Mean Absolute Error (MAE)
    • Root Mean Square Error (RMSE)
    • Median Error
    • Maximum Error
    • MAE per offset magnitude
  2. Cross-Method Agreement:
    • Mean |audio − visual| difference
    • Median |audio − visual| difference
    • Percentage within threshold (default: 100ms)
    • Number of comparable pairs
  3. Confidence Validation (per method):
    • Pearson correlation between confidence and error
    • MAE for all samples
    • MAE after filtering bottom 20% confidence
    • Confidence threshold used for filtering
  4. Efficiency (per method):
    • Mean runtime
    • Median runtime
    • Total runtime
    • Runtime per video minute
  5. Resource Usage (per method):
    • Mean peak CPU percentage
    • Max peak CPU percentage
    • Mean peak memory (MB)
    • Max peak memory (MB)
  6. Grouped Metrics (by sensitivity tags):
    • MAE and RMSE grouped by:
      • Video length: <30s, 30-60s, 60-120s, >120s
      • Motion level: low, medium, high, very_high
      • Audio energy level: low, medium, high, very_high

Helper Functions

_safe_pearson

_safe_pearson(x: np.ndarray, y: np.ndarray) -> float
Compute Pearson correlation coefficient, returning 0.0 for degenerate cases.
x
np.ndarray
First array.
y
np.ndarray
Second array.
Returns: Correlation coefficient in [-1, 1], or 0.0 if inputs are degenerate.

_safe_float

_safe_float(val) -> float
Convert to float, replacing NaN/Inf with 0.0 and rounding to 4 decimal places.
val
any
Value to convert.
Returns: Sanitized float value.

_print_summary

_print_summary(metrics: dict)
Print a human-readable summary table to stdout.
metrics
dict
Metrics dictionary returned by compute_metrics.

Usage Example

import logging
from evaluation.compute_metrics import compute_metrics

logging.basicConfig(level=logging.INFO)

# Compute metrics from results
metrics = compute_metrics(
    results_csv="evaluation/results/results.csv",
    metrics_dir="evaluation/metrics",
)

# Access specific metrics
print(f"Audio MAE: {metrics['accuracy']['audio']['mae_ms']:.2f} ms")
print(f"Visual MAE: {metrics['accuracy']['visual']['mae_ms']:.2f} ms")
print(f"Cross-method agreement: {metrics['cross_method_agreement']['pct_within_threshold']:.1f}%")

CLI Usage

python -m evaluation.compute_metrics
Outputs:
  • Metrics JSON to evaluation/metrics/metrics_summary.json
  • Human-readable summary table to stdout

Configuration

Default Paths

BASE_DIR = os.path.dirname(os.path.abspath(__file__))
RESULTS_CSV = os.path.join(BASE_DIR, "results", "results.csv")
METRICS_DIR = os.path.join(BASE_DIR, "metrics")

Constants

AGREEMENT_THRESHOLD_MS = 100.0  # Threshold for cross-method agreement reporting
CONFIDENCE_FILTER_QUANTILE = 0.20  # Bottom 20% confidence filtering

Example Output

JSON Structure

{
  "accuracy": {
    "audio": {
      "mae_ms": 42.35,
      "rmse_ms": 58.12,
      "median_error_ms": 31.50,
      "max_error_ms": 215.00,
      "count": 36,
      "mae_per_offset_ms": {
        "-1000": 38.20,
        "-500": 28.45,
        "100": 15.30,
        "500": 45.60,
        "1000": 52.10
      }
    },
    "visual": { /* ... */ }
  },
  "cross_method_agreement": {
    "mean_audio_video_diff_ms": 65.23,
    "median_audio_video_diff_ms": 48.15,
    "pct_within_threshold": 72.5,
    "threshold_ms": 100.0,
    "n_pairs": 36
  },
  "confidence_validation": { /* ... */ },
  "efficiency": { /* ... */ },
  "resource_usage": { /* ... */ },
  "grouped_by_tag": { /* ... */ }
}

Console Output

======================================================================
  EVALUATION METRICS SUMMARY
======================================================================

── Accuracy ───────────────────────────────────────────
  [AUDIO]
    MAE:    42.35 ms
    RMSE:   58.12 ms
    Median: 31.50 ms
    Max:    215.00 ms
    Count:  36
    Per-offset MAE:
      -1000 ms → 38.20 ms error
       -500 ms → 28.45 ms error
        100 ms → 15.30 ms error
        500 ms → 45.60 ms error
       1000 ms → 52.10 ms error

  [VISUAL]
    ...

── Cross-Method Agreement ─────────────────────────────
  Mean |audio − visual| diff: 65.23 ms
  % within 100ms:         72.5%

── Confidence Validation ──────────────────────────────
  [AUDIO]
    Pearson(confidence, error): -0.4523
    MAE (all):                  42.35 ms
    MAE (filtered, top 80%):    35.12 ms

── Efficiency ─────────────────────────────────────────
  [AUDIO]
    Mean runtime: 3.45s
    Per video-min: 2.12s

── Resource Usage ─────────────────────────────────────
  [AUDIO]
    Mean peak CPU:    45.2%
    Max  peak CPU:    78.5%
    Mean peak memory: 256.3 MB
    Max  peak memory: 412.8 MB

======================================================================

Notes

  • Error Metrics: All error metrics are in milliseconds for easy interpretation.
  • Confidence Correlation: Negative Pearson correlation indicates higher confidence correlates with lower error (desired behavior).
  • Filtering: Confidence filtering removes the bottom 20% of results by confidence score, demonstrating the value of confidence metrics.
  • Sensitivity Analysis: Grouped metrics reveal performance patterns across different video characteristics.

Build docs developers (and LLMs) love