PreProcessor class handles the import of CAN data from CSV/TSV files and generates dictionaries of ArbID and J1979 objects for pipeline processing.
Class: PreProcessor
Constructor
Path to the CAN data file (CSV/TSV format with timestamp, arbitration ID, DLC, and data bytes)
Path for pickled arbitration ID dictionary output
Path for pickled J1979 dictionary output
data: DataFrame- Imported CAN dataimport_time: float- Time taken for CSV importdictionary_time: float- Time taken for dictionary generationtotal_time: float- Total processing time
Methods
import_csv
Timer instance for performance tracking
Path to the CSV file to import
- Skips first 7 header rows
- Converts hexadecimal strings to integers for ID and data bytes
- Converts timestamp format (removes trailing characters)
- Sets time as the DataFrame index
- Handles malformed timestamps gracefully
generate_j1979_dictionary
DataFrame containing J1979 response messages (from arbitration ID 0x7E8/2024)
Dictionary mapping PID values (from byte 2) to
J1979 objectsgenerate_arb_id_dictionary
Timer instance for performance tracking
Normalization function to apply to TANG values (e.g.,
sklearn.preprocessing.minmax_scale)Multiplier to convert time units (e.g., 1000 for seconds to milliseconds)
Z-score for confidence interval in frequency analysis (e.g., 1.645 for 90% confidence)
Threshold for determining if signals are synchronous (in time units after conversion)
If True, regenerate dictionaries even if pickled files exist
Tuple containing:
id_dictionary: Dictionary mapping arbitration IDs toArbIDobjectsj1979_dictionary: Dictionary mapping PIDs toJ1979objects
- Checks for existing pickled files and loads them if
force=False - Imports CSV data if needed
- Filters out specific IDs:
0x7DF(2015): J1979 requests - ignored0x7E8(2024): J1979 responses - processed separately
- Validates each arbitration ID:
- Skips IDs with variable DLC
- Removes padding bytes beyond DLC length
- Corrects duplicate timestamps
- Generates binary matrices and TANG for each ID
- Analyzes transmission frequencies
Usage Example
Implementation Notes
Data Cleaning
- Variable DLC: Arbitration IDs with inconsistent Data Length Codes are excluded from analysis
- Padding Removal: Data bytes beyond the DLC are automatically removed
- Duplicate Timestamps: Duplicate index values are corrected by removing duplicates
J1979 Detection
- Automatically detects J1979 diagnostic responses on arbitration ID
0x7E8(2024) - Groups responses by PID (byte 2)
- Request messages on
0x7DF(2015) are ignored
Performance
- Uses pickle files for caching to speed up repeated runs
- Set
force=Trueto regenerate from raw data - Timer tracks import time, dictionary creation time, and per-ID processing time