Compute Metrics

Overview

The compute_metrics module analyzes the results CSV from run_batch and computes comprehensive performance metrics including accuracy (MAE, RMSE), cross-method agreement, confidence validation, efficiency, resource usage, and sensitivity analysis.

Functions

compute_metrics

compute_metrics(
    results_csv: str = RESULTS_CSV,
    metrics_dir: str = METRICS_DIR,
) -> dict

Compute all evaluation metrics from results CSV.

results_csv

str

default:"evaluation/results/results.csv"

Path to the results CSV generated by run_batch.

metrics_dir

str

default:"evaluation/metrics"

Output directory for the metrics JSON file.

Returns: Dictionary containing all computed metrics. Output Files:

metrics_dir/metrics_summary.json: Complete metrics in JSON format
Stdout: Human-readable summary table

Metrics Computed:

Accuracy (per method):
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- Median Error
- Maximum Error
- MAE per offset magnitude
Cross-Method Agreement:
- Mean |audio − visual| difference
- Median |audio − visual| difference
- Percentage within threshold (default: 100ms)
- Number of comparable pairs
Confidence Validation (per method):
- Pearson correlation between confidence and error
- MAE for all samples
- MAE after filtering bottom 20% confidence
- Confidence threshold used for filtering
Efficiency (per method):
- Mean runtime
- Median runtime
- Total runtime
- Runtime per video minute
Resource Usage (per method):
- Mean peak CPU percentage
- Max peak CPU percentage
- Mean peak memory (MB)
- Max peak memory (MB)
Grouped Metrics (by sensitivity tags):
- MAE and RMSE grouped by:
  - Video length: <30s, 30-60s, 60-120s, >120s
  - Motion level: low, medium, high, very_high
  - Audio energy level: low, medium, high, very_high

Helper Functions

_safe_pearson

_safe_pearson(x: np.ndarray, y: np.ndarray) -> float

Compute Pearson correlation coefficient, returning 0.0 for degenerate cases.

np.ndarray

First array.

np.ndarray

Second array.

Returns: Correlation coefficient in [-1, 1], or 0.0 if inputs are degenerate.

_safe_float

_safe_float(val) -> float

Convert to float, replacing NaN/Inf with 0.0 and rounding to 4 decimal places.

val

any

Value to convert.

Returns: Sanitized float value.

_print_summary

_print_summary(metrics: dict)

Print a human-readable summary table to stdout.

metrics

dict

Metrics dictionary returned by compute_metrics.

Usage Example

import logging
from evaluation.compute_metrics import compute_metrics

logging.basicConfig(level=logging.INFO)

# Compute metrics from results
metrics = compute_metrics(
    results_csv="evaluation/results/results.csv",
    metrics_dir="evaluation/metrics",
)

# Access specific metrics
print(f"Audio MAE: {metrics['accuracy']['audio']['mae_ms']:.2f} ms")
print(f"Visual MAE: {metrics['accuracy']['visual']['mae_ms']:.2f} ms")
print(f"Cross-method agreement: {metrics['cross_method_agreement']['pct_within_threshold']:.1f}%")

CLI Usage

python -m evaluation.compute_metrics

Outputs:

Metrics JSON to evaluation/metrics/metrics_summary.json
Human-readable summary table to stdout

Configuration

Default Paths

BASE_DIR = os.path.dirname(os.path.abspath(__file__))
RESULTS_CSV = os.path.join(BASE_DIR, "results", "results.csv")
METRICS_DIR = os.path.join(BASE_DIR, "metrics")

Constants

AGREEMENT_THRESHOLD_MS = 100.0  # Threshold for cross-method agreement reporting
CONFIDENCE_FILTER_QUANTILE = 0.20  # Bottom 20% confidence filtering

Example Output

JSON Structure

{
  "accuracy": {
    "audio": {
      "mae_ms": 42.35,
      "rmse_ms": 58.12,
      "median_error_ms": 31.50,
      "max_error_ms": 215.00,
      "count": 36,
      "mae_per_offset_ms": {
        "-1000": 38.20,
        "-500": 28.45,
        "100": 15.30,
        "500": 45.60,
        "1000": 52.10
      }
    },
    "visual": { /* ... */ }
  },
  "cross_method_agreement": {
    "mean_audio_video_diff_ms": 65.23,
    "median_audio_video_diff_ms": 48.15,
    "pct_within_threshold": 72.5,
    "threshold_ms": 100.0,
    "n_pairs": 36
  },
  "confidence_validation": { /* ... */ },
  "efficiency": { /* ... */ },
  "resource_usage": { /* ... */ },
  "grouped_by_tag": { /* ... */ }
}

Console Output

======================================================================
  EVALUATION METRICS SUMMARY
======================================================================

── Accuracy ───────────────────────────────────────────
  [AUDIO]
    MAE:    42.35 ms
    RMSE:   58.12 ms
    Median: 31.50 ms
    Max:    215.00 ms
    Count:  36
    Per-offset MAE:
      -1000 ms → 38.20 ms error
       -500 ms → 28.45 ms error
        100 ms → 15.30 ms error
        500 ms → 45.60 ms error
       1000 ms → 52.10 ms error

  [VISUAL]
    ...

── Cross-Method Agreement ─────────────────────────────
  Mean |audio − visual| diff: 65.23 ms
  % within 100ms:         72.5%

── Confidence Validation ──────────────────────────────
  [AUDIO]
    Pearson(confidence, error): -0.4523
    MAE (all):                  42.35 ms
    MAE (filtered, top 80%):    35.12 ms

── Efficiency ─────────────────────────────────────────
  [AUDIO]
    Mean runtime: 3.45s
    Per video-min: 2.12s

── Resource Usage ─────────────────────────────────────
  [AUDIO]
    Mean peak CPU:    45.2%
    Max  peak CPU:    78.5%
    Mean peak memory: 256.3 MB
    Max  peak memory: 412.8 MB

======================================================================

Notes

Error Metrics: All error metrics are in milliseconds for easy interpretation.
Confidence Correlation: Negative Pearson correlation indicates higher confidence correlates with lower error (desired behavior).
Filtering: Confidence filtering removes the bottom 20% of results by confidence score, demonstrating the value of confidence metrics.
Sensitivity Analysis: Grouped metrics reveal performance patterns across different video characteristics.

Core Modules

Utilities

Evaluation

Overview

Functions

compute_metrics

Helper Functions

_safe_pearson

_safe_float

_print_summary

Usage Example

CLI Usage

Configuration

Default Paths

Constants

Example Output

JSON Structure

Console Output

Notes

Build docs developers (and LLMs) love

Core Modules

Utilities

Evaluation

​Overview

​Functions

​compute_metrics

​Helper Functions

​_safe_pearson

​_safe_float

​_print_summary

​Usage Example

​CLI Usage

​Configuration

​Default Paths

​Constants

​Example Output

​JSON Structure

​Console Output

​Notes

Build docs developers (and LLMs) love

Overview

Functions

compute_metrics

Helper Functions

_safe_pearson

_safe_float

_print_summary

Usage Example

CLI Usage

Configuration

Default Paths

Constants

Example Output

JSON Structure

Console Output

Notes