Overview
The preprocessing phase is the first stage of the CAN reverse engineering pipeline. It imports raw CAN log data from CSV files, converts it to Pandas DataFrames, transforms hex values to binary matrices, and calculates the TANG (Transition Analysis Numerical Gradient) for each Arbitration ID.Key Responsibilities
PreProcessor Class
ThePreProcessor class (defined in PreProcessor.py) orchestrates the entire preprocessing phase.
Initialization
Path to the input CAN log file in TSV format
Path where the Arbitration ID dictionary will be pickled
Path where J1979 diagnostic signals will be pickled
CSV Import
Theimport_csv() method reads CAN log files with automatic type conversion.
File Format
Expected CSV format (whitespace-delimited, skips first 7 header rows):Conversion Functions
The import process applies converters to each column (seePreProcessor.py:36-37):
The CSV reader automatically skips 7 header rows and uses whitespace as the delimiter.
Binary Matrix Generation
Each Arbitration ID’s hex payload is converted to a binary matrix inArbID.generate_binary_matrix_and_tang() (ArbID.py:29-59).
Process
- Rows = individual CAN messages (time samples)
- Columns = bit positions (DLC × 8 bits)
- Values = 0 or 1 (binary representation)
A message with DLC=8 produces 64 columns (8 bytes × 8 bits). If DLC < 8, unused bytes are automatically dropped.
TANG Calculation
The Transition Analysis Numerical Gradient (TANG) measures how frequently each bit position changes between consecutive messages.Implementation
FromArbID.py:52-57:
Interpretation
- High TANG values (close to 1.0): Bit position changes frequently → likely part of a dynamic signal
- Low TANG values (close to 0.0): Bit position rarely changes → likely padding or static data
- Zero TANG value: Bit never changes → padding bit
J1979 Signal Detection
J1979 (OBD-II) diagnostic signals use specific Arbitration IDs:- 0x7DF (2015): Diagnostic requests (ignored)
- 0x7E8 (2024): Diagnostic responses (extracted)
PreProcessor.py:96-108:
J1979 class (in J1979.py) processes common diagnostic PIDs:
- 0x0C: Engine RPM
- 0x0D: Vehicle Speed (km/h)
- 0x11: Throttle Position (%)
- 0x61: Demand Torque (%)
- 0x62: Actual Torque (%)
- 0x63: Reference Torque (Nm)
- 0x8E: Engine Friction Torque (%)
Transmission Frequency Analysis
Theanalyze_transmission_frequency() method calculates statistical properties of message timing (ArbID.py:61-89).
Metrics Calculated
Synchronous Detection
An Arbitration ID is marked as synchronous if:Configuration Parameters
All parameters are defined inMain.py:64-68:
Multiplier to convert time units (default: seconds to milliseconds)
Z-score for confidence interval calculation (0.9 = 90% confidence)Common values:
- 1.28 → 80% confidence
- 1.645 → 90% confidence
- 1.96 → 95% confidence
- 2.33 → 98% confidence
- 2.58 → 99% confidence
Maximum mean-to-CI ratio for a signal to be considered synchronousLower values = stricter synchronicity requirement
If true, regenerates all data from scratch instead of loading cached pickles
Usage Example
FromMain.py:83-90:
Output Data Structures
Arbitration ID Dictionary
Keyed by Arbitration ID (int), values areArbID objects with:
id: Arbitration ID (int)dlc: Data Length Code (int)original_data: Pandas DataFrame (time-indexed hex values)boolean_matrix: NumPy ndarray (binary representation)tang: NumPy ndarray (normalized transition frequencies)static: Boolean (True if no bit ever changes)freq_mean,freq_std,freq_ci: Transmission frequency statisticssynchronous: Boolean (True if messages are periodic)tokenization: List of token tuples (populated in lexical analysis)padding: List of padding bit indices (populated in lexical analysis)
J1979 Dictionary
Keyed by PID (int), values areJ1979 objects with:
pid: Parameter ID (int)title: Human-readable name (str)data: Pandas Series (time-indexed converted values)
Edge Cases Handled
Truncated Logs
Malformed final lines with incomplete time values are skipped (
PreProcessor.py:29-33)See Also
- Lexical Analysis - Next stage: tokenization
- ArbID Class - Complete ArbID object reference
- J1979 Class - OBD-II diagnostic signal handling