Every project in this repository passes raw data through the same family of preprocessing steps before any model sees it. The preprocessing pipeline lives inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
src/Processing/preprocessing.py and is responsible for converting messy, real-world input into a clean, fixed-width numeric vector that the trained model expects. The same logic runs both at training time (in the notebook) and at inference time (in the Flask API), ensuring that the feature representation is always consistent.
Preprocessing steps at a glance
| Preprocessing Technique | When to Use | Implementation |
|---|---|---|
| Drop high-missing columns | Columns with >40% missing values add noise without signal | df.drop(columns=[...]) |
| Fill remaining missing values | Sparse missingness that can be imputed statistically | fillna(mean) / fillna(mode) or hardcoded defaults |
| Drop irrelevant identifiers | Columns like Id that carry no predictive signal | df.drop(columns=["Id", ...]) |
| One-hot encode categoricals | Convert string categories to binary dummy columns | pd.get_dummies() / manual mapping |
| Convert booleans to int | sklearn models require numeric input | .astype(int) |
| Train / test split (80/20) | Hold out evaluation data before any fitting | train_test_split(..., test_size=0.2) |
| Feature scaling | Distance-based or gradient-based models are sensitive to scale | StandardScaler / MinMaxScaler |
Step-by-step breakdown
1. Dropping columns with high missing rate
1. Dropping columns with high missing rate
Columns where more than roughly 40% of rows are Rows where the target variable (
NaN are removed before any imputation. Keeping them would force the model to learn from predominantly filled-in guesses, which degrades generalization.In the House Price project the following columns are dropped outright:SalePrice) is missing are also removed, because a training example without a label cannot contribute to supervised learning.2. Filling remaining missing values
2. Filling remaining missing values
After the high-missingness columns are gone, sparse gaps in the remaining columns are filled with statistical defaults. For inference, hardcoded dataset-level mean and mode values are used so that the live API does not need access to the training set.
3. Dropping irrelevant columns
3. Dropping irrelevant columns
Identifier columns (
Id), columns that are largely redundant with other features (YearRemodAdd), and columns with excessive cardinality that were not selected during feature engineering (Exterior1st, BsmtFinSF2) are removed. This reduces dimensionality and prevents the model from overfitting to noise.The selected feature set for the House Price model after dropping is:4. One-hot encoding categorical features
4. One-hot encoding categorical features
String-valued categorical columns are expanded into binary dummy columns. Each unique category value becomes its own column, set to
1 if that category applies and 0 otherwise. One category per group is omitted (the reference category) to avoid multicollinearity.Category validation is applied before encoding. If Gemini returns an unrecognized value for a categorical column, the pipeline falls back to the default category rather than raising an error.
5. Converting boolean features to int
5. Converting boolean features to int
pd.get_dummies() produces boolean (True/False) columns by default in recent pandas versions. sklearn estimators require numeric input, so boolean columns are cast to int (1/0).build_one_hot_vector function constructs the vector directly as dtype=float, so no explicit boolean conversion is needed.6. Train / test split (80/20)
6. Train / test split (80/20)
The dataset is split into 80% training and 20% testing before any model is fitted. The test set is held out entirely and is only used for final evaluation, preventing data leakage.
7. Feature scaling
7. Feature scaling
Tree-based models are scale-invariant, but linear models (Linear Regression, Ridge, Lasso) and distance-based models converge faster and perform better when features are on a comparable scale. Where scaling is applied,
StandardScaler (zero mean, unit variance) is the default choice.