LexicalAnalysis module provides functions for tokenizing CAN message payloads and generating time-series signals from bit-level patterns.
Functions
tokenize_dictionary
Timer instance for performance tracking
Dictionary mapping arbitration IDs to
ArbID objects (from PreProcessor)If True, re-tokenize even if tokenization has already been performed
If True, include padding bits (with near-zero transition frequency) in tokens
If True, merge adjacent tokens with similar transition frequencies
Maximum TANG distance between adjacent bits for merging (normalized 0-1 scale)
- Skips static arbitration IDs (all bits constant)
- Calls
get_composition()to identify tokens - Optionally calls
merge_tokens()to combine adjacent tokens - Stores tokenization results in each
ArbIDobject’stokenizationattribute
get_composition
The arbitration ID object to tokenize (modified in-place)
If True, include padding bits (transition frequency ≤ 0.000001) in clustering
Maximum allowed inversion in TANG values within a token (0.0 = strictly monotonic)
- Iterates through TANG (Transition Aggregation N-Gram) bit by bit
- Identifies padding bits (transition frequency ≤ 0.000001)
- Groups contiguous bits with monotonic transition frequencies
- Detects endianness (big-endian vs little-endian) from first two bits
- Allows small inversions if
max_inversion_distance> 0 - Stores tokens as tuples
(start_bit_index, end_bit_index)inarb_id.tokenization
- Big-endian: TANG values increase across token
- Little-endian: TANG values decrease across token
- Determined from the relationship between the first two bits in each token
merge_tokens
The arbitration ID object with tokens to merge (modified in-place)
Maximum TANG difference between adjacent token boundaries for merging
- Only merges tokens that are adjacent (no gap between them)
- Compares
abs(tang[token1_end] - tang[token2_start])to threshold - Updates tokenization list by replacing merged tokens
- Skips static arbitration IDs or those with < 2 tokens
generate_signals
Signal objects from tokenized arbitration IDs.
Timer instance for performance tracking
Dictionary of tokenized
ArbID objectsPath to pickle file for caching signal dictionary
Normalization function (e.g.,
sklearn.preprocessing.minmax_scale)If True, regenerate signals even if pickle file exists
Nested dictionary:
{arb_id: {(arb_id, start, stop): Signal, ...}, ...}Structure:- Outer key: Arbitration ID (int)
- Inner key: Tuple
(arb_id, start_bit, stop_bit) - Value:
Signalobject with normalized time series
- For each token in each non-static arbitration ID:
- Extract bit columns from boolean matrix
- Convert binary rows to unsigned integers
- Create pandas Series with original timestamps
- Apply normalization strategy
- Calculate signal metadata (Shannon entropy, etc.)
- Returns nested dictionary keyed by arbitration ID and signal boundaries
Usage Example
Complete Pipeline Example
Algorithm Details
TANG-Based Tokenization
The tokenization algorithm uses the Transition Aggregation N-Gram (TANG) to identify signal boundaries:-
TANG Values: Each bit position has a transition frequency (0.0 to 1.0)
- 0.0: Constant/padding bit
- 1.0: Bit that changes every message
- 0.5: Bit that changes approximately every other message
-
Clustering Logic:
- Bits with monotonically increasing/decreasing TANG values form a token
- Endianness is determined by the direction (increasing = big-endian)
- Small inversions are allowed if
max_inversion_distance> 0
-
Token Merging:
- Adjacent tokens with similar boundary TANG values are merged
- Reduces over-segmentation from noise in TANG
Binary to Integer Conversion
For each token:- Extract bit columns from boolean matrix
- Concatenate bits row-by-row into binary strings
- Convert each binary string to unsigned integer
- Create time-indexed series preserving original timestamps
- Normalize using provided strategy (e.g., min-max scaling)
Performance Considerations
- Caching: Signal generation is cached in pickle files for fast reloading
- In-place Modification:
get_composition()andmerge_tokens()modifyArbIDobjects directly - Iteration Tracking: Timer records per-token processing time for performance analysis