Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RaviTejaMedarametla/nba-data-preprocessing/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The preprocessing stage cleans and standardizes raw NBA player data. ThePreprocessor class handles type conversions, missing value imputation, and outlier detection to prepare data for feature engineering.
Preprocessor Class
Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:7
Initialization
Random seed for reproducible preprocessing operations
Strategy for handling missing numeric values. Options:
'median': Fill with column median (default, robust to outliers)'mean': Fill with column mean- Other: Fill with 0
Core Methods
clean()
Performs comprehensive data cleaning and type conversion.Raw DataFrame containing NBA player data
~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:12-24
1. Date Parsing
- Converts birth dates from
'MM/DD/YY'format - Converts draft years from
'YYYY'format - Invalid dates become
NaT(coerced)
2. Team Handling
- Players without teams are labeled as
'No Team'
3. Height Conversion
- Extracts metric height from dual format strings
- Converts to float for numeric operations
4. Weight Conversion
- Extracts metric weight
- Removes unit suffix
- Converts to float
5. Salary Parsing
6. Country Normalization
- Binary categorization: USA vs international players
7. Draft Round Handling
- Standardizes undrafted players to round
'0'
The
clean() method automatically calls handle_missing() to impute missing values after transformations.handle_missing()
Imputes missing values using the configured strategy. Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:26-41
DataFrame potentially containing missing values
Numeric Columns
- Median (default): Robust to outliers, recommended for skewed distributions
- Mean: Appropriate for normally distributed data
- Zero: Fallback for any other strategy value
Categorical Columns
detect_outliers_iqr()
Detects outliers using the Interquartile Range (IQR) method. Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/preprocessing/core.py:43-51
DataFrame to check for outliers (only numeric columns are analyzed)
IQR multiplier for outlier threshold. Common values:
1.5: Standard outlier detection (default)3.0: Extreme outlier detection (more conservative)
True = outlier)
Algorithm:
Data Flow
The preprocessing stage follows this flow:Integration with Pipeline
The preprocessing stage integrates with the streaming engine: Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:37
~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:79
~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:426
Performance Considerations
Time Complexity
- clean(): O(n) where n = number of rows
- handle_missing(): O(n × m) where m = number of columns
- detect_outliers_iqr(): O(n × k) where k = number of numeric columns
Memory Efficiency
All methods return copies of DataFrames, preserving the original data.
Streaming Compatibility
✅ All preprocessing methods work seamlessly with chunked data streams.Best Practices
Choose the Right Strategy
Use
'median' for skewed salary distributions, 'mean' for normally distributed physical measurements.Monitor Outliers
Track outlier rates over time to detect data quality issues early.
Validate Transformations
Always inspect a sample of cleaned data before processing the full dataset.
Document Assumptions
Record why you chose specific cleaning strategies for reproducibility.
Next Steps
Feature Engineering
Build derived features from cleaned data
Validation
Validate data quality and detect drift