SemanticAnalysis module provides functions for correlating signals, clustering related signals, and propagating labels across the signal space.
Functions
subset_selection
Timer instance for performance tracking
Signal dictionary from
generate_signals() (nested dict by arbitration ID)Path to pickle file for caching subset DataFrame
If True, regenerate subset even if pickle file exists
Fraction of non-static signals to include in subset (0.0 to 1.0)
DataFrame with signals as columns (keyed by
(arb_id, start, stop) tuples) and timestamps as rows, re-indexed to a common time base using nearest-neighbor interpolation- Identifies all non-static signals
- Ranks signals by Shannon Index (descending)
- Selects top
subset_size * 100%of signals - Re-indexes all signals to the longest time series using nearest-neighbor interpolation
- Returns DataFrame suitable for correlation analysis
subset_correlation
DataFrame from
subset_selection() with signals as columnsPath to CSV file for caching correlation matrix
If True, recompute correlation even if CSV exists
Correlation matrix with signal tuples as both row and column indices (symmetric matrix with 1.0 on diagonal)
pandas.DataFrame.corr() to compute Pearson correlation coefficients.
greedy_signal_clustering
Correlation matrix from
subset_correlation()Minimum absolute correlation coefficient for clustering (0.0 to 1.0)
If True, allow signals to belong to multiple clusters (fuzzy clustering)
Dictionary mapping cluster IDs (int) to lists of signal keys
[(arb_id, start, stop), ...]- Iterates through correlation matrix (upper triangle)
- For each correlation ≥ threshold:
- If both signals are unlabeled: create new cluster
- If one is labeled: add unlabeled signal to existing cluster(s)
- If both are labeled:
- Fuzzy mode: Create bridge cluster if they share no common clusters
- Non-fuzzy mode: Skip (already clustered)
- Removes duplicate clusters
label_propagation
Timer instance for performance tracking
Path to pickle file for updated cluster dictionary
Path to pickle file for complete signals DataFrame
Path to CSV file for complete correlation matrix
Complete signal dictionary from
generate_signals()Initial cluster dictionary from
greedy_signal_clustering()Minimum correlation for label propagation
If True, regenerate all data even if cached files exist
Tuple containing:
df: DataFrame with all non-static signals (columns) and common time index (rows)correlation_matrix: Full correlation matrix for all signalscluster_dict: Updated cluster dictionary with propagated labels
- Combines all non-static signals into one DataFrame
- Re-indexes all to common time base (largest index)
- Computes full correlation matrix
- For each signal pair with correlation ≥ threshold:
- If one is clustered and one isn’t: add unclustered signal to cluster
- If both are clustered: skip (already labeled)
- If neither is clustered: skip (no label to propagate)
- Drops rows/columns with NaN values
- Returns updated DataFrame, correlation matrix, and expanded clusters
j1979_signal_labeling
Timer instance for performance tracking
Path to pickle file for J1979 correlation matrix
DataFrame of signals from
label_propagation()J1979 dictionary from PreProcessor
Signal dictionary (modified in-place with J1979 labels)
Minimum absolute correlation for labeling
If True, regenerate correlation even if cached file exists
Tuple containing:
signal_dict: Updated signal dictionary withj1979_titleandj1979_pccattributescorrelation_matrix: Correlation matrix between signals and J1979 PIDs (signals as rows, PIDs as columns)
- Aligns signal and J1979 time ranges (uses overlapping interval)
- Re-indexes J1979 data to match signal timestamps
- Concatenates signals and J1979 into one DataFrame
- Computes correlation matrix
- For each signal:
- Finds J1979 PID with highest absolute correlation
- If correlation ≥ threshold: adds
j1979_titleandj1979_pccto Signal object
- Returns updated signal dictionary and correlation matrix
Usage Example
Complete Analysis Pipeline
Algorithm Details
Shannon Index Selection
Shannon entropy measures signal information content:- High entropy: Signal values are uniformly distributed (maximum information)
- Low entropy: Signal values are concentrated (less informative)
- Subset selection prioritizes high-entropy signals for computational efficiency
Greedy Clustering
The algorithm is “greedy” because it:- Processes correlations in matrix order (not optimal ordering)
- Makes immediate decisions without backtracking
- Creates clusters on first encounter
- Fast: O(n²) for n signals
- Simple: No hyperparameters except threshold
- Non-deterministic: Cluster IDs depend on iteration order
Fuzzy vs. Non-Fuzzy
Fuzzy Labeling (fuzzy_labeling=True):
- Signals can belong to multiple clusters
- Captures overlapping semantics (e.g., gear position correlates with both speed and RPM)
- Creates bridge clusters between related groups
fuzzy_labeling=False):
- Each signal belongs to at most one cluster
- Simpler interpretation
- May miss overlapping relationships
Label Propagation
Inspired by semi-supervised learning:- Labeled set: High-entropy subset with initial clusters
- Unlabeled set: Remaining signals
- Propagation: Assign labels based on correlation with labeled signals
- No new clusters: Unlike
greedy_signal_clustering, only extends existing clusters
Performance Considerations
- Subset Size: Smaller subsets (0.1-0.3) are faster but may miss rare signals
- Correlation Threshold: Higher thresholds (0.85-0.95) create tighter, more meaningful clusters
- Caching: All functions support pickle/CSV caching for fast re-runs
- Memory: Full correlation matrix is O(n²) for n signals; subset selection reduces this