What is CAN Payload Reverse Engineering?
CAN payload reverse engineering is the process of extracting semantic meaning from proprietary Controller Area Network messages without access to the original protocol documentation. This is necessary because automotive OEMs treat their CAN signal definitions as trade secrets.This research was developed by Dr. Brent Stone at the Air Force Institute of Technology (AFIT) as part of a Ph.D. dissertation titled “Enabling Auditing and Intrusion Detection for Proprietary Controller Area Networks.”
The Challenge
Modern vehicles transmit hundreds of signals over CAN buses, but only a small subset follows public standards like J1979:Problems to Solve
- Unknown Signal Boundaries: Which bits belong to which signal?
- Mixed Encodings: Signals use different byte orders (endianness)
- Variable Bit Lengths: Signals range from 1 to 64 bits
- Overlapping Signals: Multiple signals may share payload bytes
- No Ground Truth: Limited labeled data for validation
TANG: Transition Analysis Numerical Gradient
Concept
TANG is a novel metric that measures the frequency of bit transitions in a time series of CAN messages. It’s based on the hypothesis that bits belonging to the same numerical signal will exhibit similar transition patterns.Algorithm
Convert to Binary Matrix
Transform hex payload to binary matrix (rows = messages, columns = bit positions).
Calculate Transitions
Apply XOR operation between consecutive rows to detect bit changes.
XOR returns 1 when bits differ between consecutive messages, effectively counting transitions at each bit position.
Interpretation
| TANG Value | Interpretation |
|---|---|
| 0.0 | Padding or static bit |
| 0.1 - 0.3 | Slowly changing signal (e.g., temperature) |
| 0.4 - 0.7 | Moderately dynamic signal (e.g., speed) |
| 0.8 - 1.0 | Highly dynamic signal (e.g., RPM, steering angle) |
Example: Tokenization Using TANG
The
max_inversion_distance parameter (default: 0.2) allows for small deviations in the TANG gradient, accommodating noise and rounding in the data.Shannon Index: Signal Entropy
Concept
The Shannon Index (also known as Shannon entropy) measures the diversity of values in a signal’s time series. It helps distinguish dynamic signals from static or slowly-changing ones.Formula
Where:- = Shannon Index
- = Proportion of samples with value
- = Number of unique values
Implementation
Use in Pipeline
Shannon Index drives the subset selection phase:Focusing on high-entropy signals reduces computational cost and improves clustering quality by prioritizing the most informative data.
Correlation-Based Clustering
Hypothesis
Signals that represent the same physical phenomenon (e.g., vehicle speed) should be highly correlated even if they appear in different Arb IDs or use different encodings.Pearson Correlation Coefficient
The pipeline uses Pearson’s r to measure linear relationships: Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation).Greedy Clustering Algorithm
The pipeline implements a greedy agglomerative clustering approach:Apply Rules
- If both signals unlabeled → Create new cluster
- If one labeled → Add unlabeled signal to existing cluster
- If both labeled (different clusters) → Merge clusters (if fuzzy labeling)
- If both labeled (same cluster) → Skip
Label Propagation
After clustering the top 25% signals, the pipeline propagates labels to the remaining signals:Label propagation uses the same correlation threshold but only extends existing clusters - it never creates new ones. This prevents low-entropy signals from polluting the cluster structure.
J1979 as Ground Truth
Strategy
Public J1979 diagnostic signals provide labeled ground truth for validation:Example Results
A proprietary signal in Arb ID 0x245, bits 24-39, might show:- Correlation with J1979 PID 0x0C (Engine RPM): r = 0.94
- Auto-labeled as “Engine RPM”
This automatic labeling demonstrates that the pipeline successfully identifies proprietary representations of standardized vehicle parameters.
Real-World Applications
1. Vehicle Security Auditing
Identify anomalous CAN messages by comparing observed signals against known clusters:2. Aftermarket Diagnostics
Extract proprietary signals not available via J1979:- Tire pressure monitoring system (TPMS) data
- Advanced driver assistance system (ADAS) status
- Battery management system (BMS) metrics in EVs
3. Forensic Analysis
Reconstruct vehicle behavior from CAN logs after incidents:- Brake application timing
- Steering angle changes
- Driver attention monitoring
4. Performance Tuning
Access hidden parameters for optimization:- Turbocharger boost pressure
- Fuel injector timing
- Transmission shift points
Limitations and Assumptions
Assumptions
- Continuous Numerical Signals: The pipeline assumes signals represent continuous numerical values (e.g., speed, RPM) rather than discrete states or bitfields
- Consistent Encoding: Signal definitions remain constant throughout the capture session
- Sufficient Variability: Signals must change during capture to compute meaningful TANG and Shannon Index values
- Linear Relationships: Correlation clustering works best for linearly related signals
Limitations
- Static Signals: Constant values (e.g., VIN, firmware version) produce zero TANG and are ignored
- Event-Driven Messages: Rare events may lack sufficient samples for statistical analysis
- Complex Encodings: Non-linear encodings (e.g., logarithmic, lookup tables) may reduce correlation
- Bitfields: Packed boolean flags within a byte are difficult to separate
- Cryptographic Obfuscation: Encrypted or authenticated payloads cannot be analyzed
For best results, capture CAN data during dynamic driving conditions (acceleration, braking, turning) to maximize signal variability.
Validation Challenges
Without OEM documentation, validation relies on:- Correlation with J1979 ground truth (when available)
- Manual inspection of signal plots
- Domain knowledge of vehicle behavior
- Consistency across multiple capture sessions
Parameter Tuning
Key parameters that affect pipeline performance:Research Background
This work builds on prior research in:- Automated protocol reverse engineering (Polyglot, AutoFormat)
- CAN intrusion detection (Entropy-based anomaly detection)
- Time series clustering (Hierarchical clustering, DTW)
Key Innovation
The combination of:- TANG for bit-level signal boundary detection
- Shannon Index for entropy-based filtering
- Correlation clustering for semantic grouping
- J1979 ground truth for validation
Next Steps
- Learn about CAN Protocol fundamentals
- Review the Pipeline Architecture
- Try the Getting Started guide with your own CAN data
- Read the full dissertation: AFIT-END-DS-18-D-003.pdf