feature_extractor module computes 20 statistical features from network flows for machine learning classification and anomaly detection. Features include timing metrics, flow statistics, payload size distributions, and entropy measures.
Functions
extract_features()
Extract all 20 features from a single FlowRecord.flow(FlowRecord): Flow record to analyze
dict: Dictionary with 20 feature values plus 5-tuple identifiers
Timing Features (6)
mean_iat(float): Mean inter-arrival time between packets in secondsstd_iat(float): Standard deviation of inter-arrival timesmin_iat(float): Minimum inter-arrival timemax_iat(float): Maximum inter-arrival timeburstiness(float): Coefficient of variation (std/mean) of IAT; high = bursty, low = regulariat_autocorr(float): Lag-1 autocorrelation of IAT series; detects periodic patterns
Flow Features (5)
flow_duration_s(float): Total flow duration in secondstotal_bytes(int): Total bytes transferredtotal_packets(int): Total packet countbytes_per_second(float): Throughput in bytes/secondpackets_per_second(float): Packet rate in packets/second
Size Features (5)
payload_len_mean(float): Mean payload size per packetpayload_len_std(float): Standard deviation of payload sizespayload_len_min(float): Minimum payload sizepayload_len_max(float): Maximum payload sizeshannon_entropy(float): Shannon entropy of payload size distribution (0-8 bits)
Network Identifiers (5)
src_ip(str): Source IP addressdst_ip(str): Destination IP addresssrc_port(int): Source portdst_port(int): Destination portprotocol(str): Protocol name
extract_all()
Load a .flows file and extract features for every flow.flows_file(str): Path to JSON lines.flowsfile fromflow_parser
list[dict]: List of feature dictionaries, one per flow
FileNotFoundError: If flows file does not exist
- Reads flows line-by-line (JSON Lines format)
- Deserializes each line to FlowRecord
- Calls
extract_features()for each flow - Logs extraction statistics
- Returns empty list with warning if no features extracted
save_features()
Write feature vectors to both CSV and JSON Lines formats.features(list[dict]): Feature dictionaries to saveoutput_file(str): Output path (typically.csvextension)
- None
- Creates parent directories if needed
- Writes CSV file with header row (uses first feature dict for column names)
- Writes JSON Lines file (replaces
.csvwith.jsonin filename) - Logs save statistics with both output paths
{output_file}(CSV format with header){output_file_without_.csv}.json(JSON Lines format)
shannon_entropy()
Compute Shannon entropy of a byte sequence.data(bytes): Byte sequence to analyze
float: Entropy in bits (0.0 to 8.0 for byte data)
- Counts byte value frequencies (0-255)
- Computes probability distribution
- Calculates entropy using Shannon formula
- Returns 0.0 for empty input
- 0.0: All bytes identical (no randomness)
- 8.0: Uniform distribution (maximum randomness)
- TLS/encrypted traffic typically 7.5-8.0 bits
- Plaintext HTTP typically 4.5-6.0 bits
Feature Computation Details
Burstiness
Coefficient of variation of inter-arrival times:- < 0.5: Regular, periodic traffic (e.g., beacons)
- 0.5-1.0: Moderate variability (e.g., interactive sessions)
-
1.0: Bursty traffic (e.g., file transfers)
IAT Autocorrelation
Lag-1 autocorrelation detects periodicity in packet timing:- Close to 1.0: Strong positive correlation (periodic beaconing)
- Close to 0.0: No correlation (random timing)
- Close to -1.0: Alternating pattern
Payload Entropy Approximation
Since FlowRecord stores payload sizes (not raw payloads), entropy is computed from the size distribution:Zero Division Handling
The module uses_SAFE_DIVISOR = 1e-9 to prevent division by zero:
Helper Functions
_mean()
Compute arithmetic mean of a list._std()
Compute population standard deviation._iat_autocorr()
Compute lag-1 autocorrelation of inter-arrival times.Command-Line Usage
--input: Input.flowsfile from flow_parser (required)--output: Output.csvfile path (required); JSON also written alongside
Machine Learning Integration
The extracted features are designed for classification tasks: Example workflow:Feature Importance for C2 Detection
Typical feature rankings for detecting C2 beaconing:- mean_iat: Most discriminative; beacons have consistent intervals
- std_iat: Low variance indicates regular callbacks
- burstiness: Beacons have low burstiness (less than 0.5)
- iat_autocorr: High positive correlation for periodic beacons
- payload_len_mean: Beacons often have small, consistent payloads
- bytes_per_second: Low throughput distinguishes C2 from data exfiltration
Constants
_SAFE_DIVISOR
Small value substituted for zero denominators:_PROTO_MAP
IP protocol number to name mapping (inherited from flow_parser):Requirements
- Depends on:
common.logger,telemetry.flow_parser
Performance Notes
- Feature extraction is O(n) where n = packet_count per flow
- Shannon entropy computation is O(m) where m = len(payload_sizes)
- All features computed in a single pass over flow data
- No external ML libraries required for extraction
Notes
- Empty flows (zero packets) result in zero/default feature values
- Entropy approximation is specific to encrypted traffic analysis
- CSV and JSON outputs contain identical data in different formats
- Feature vectors are independent and can be processed in parallel