Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

Every project in this repository passes raw data through the same family of preprocessing steps before any model sees it. The preprocessing pipeline lives in src/Processing/preprocessing.py and is responsible for converting messy, real-world input into a clean, fixed-width numeric vector that the trained model expects. The same logic runs both at training time (in the notebook) and at inference time (in the Flask API), ensuring that the feature representation is always consistent.

Preprocessing steps at a glance

Preprocessing TechniqueWhen to UseImplementation
Drop high-missing columnsColumns with >40% missing values add noise without signaldf.drop(columns=[...])
Fill remaining missing valuesSparse missingness that can be imputed statisticallyfillna(mean) / fillna(mode) or hardcoded defaults
Drop irrelevant identifiersColumns like Id that carry no predictive signaldf.drop(columns=["Id", ...])
One-hot encode categoricalsConvert string categories to binary dummy columnspd.get_dummies() / manual mapping
Convert booleans to intsklearn models require numeric input.astype(int)
Train / test split (80/20)Hold out evaluation data before any fittingtrain_test_split(..., test_size=0.2)
Feature scalingDistance-based or gradient-based models are sensitive to scaleStandardScaler / MinMaxScaler

Step-by-step breakdown

Columns where more than roughly 40% of rows are NaN are removed before any imputation. Keeping them would force the model to learn from predominantly filled-in guesses, which degrades generalization.In the House Price project the following columns are dropped outright:
COLUMNS_TO_DROP = ["Id", "YearRemodAdd", "Exterior1st", "BsmtFinSF2"]

df = df.drop(columns=COLUMNS_TO_DROP)
Rows where the target variable (SalePrice) is missing are also removed, because a training example without a label cannot contribute to supervised learning.
df = df.dropna(subset=["SalePrice"])
After the high-missingness columns are gone, sparse gaps in the remaining columns are filled with statistical defaults. For inference, hardcoded dataset-level mean and mode values are used so that the live API does not need access to the training set.
# Defaults derived from training set statistics
DEFAULTS = {
    "MSSubClass": 20,       # mode
    "MSZoning": "RL",       # mode
    "LotArea": 9500,        # mean
    "LotConfig": "Inside",  # mode
    "BldgType": "1Fam",     # mode
    "OverallCond": 5,       # mode
    "YearBuilt": 1975,      # mean
    "TotalBsmtSF": 900,     # mean
}

def fill_missing(raw: dict) -> dict:
    clean = {}
    clean["MSSubClass"]  = safe_int(raw.get("MSSubClass"),  DEFAULTS["MSSubClass"])
    clean["LotArea"]     = safe_int(raw.get("LotArea"),     DEFAULTS["LotArea"])
    clean["OverallCond"] = safe_int(raw.get("OverallCond"), DEFAULTS["OverallCond"])
    clean["YearBuilt"]   = safe_int(raw.get("YearBuilt"),   DEFAULTS["YearBuilt"])
    clean["TotalBsmtSF"] = safe_float(raw.get("TotalBsmtSF"), DEFAULTS["TotalBsmtSF"])
    clean["MSZoning"]    = safe_str(raw.get("MSZoning"),    DEFAULTS["MSZoning"])
    clean["LotConfig"]   = safe_str(raw.get("LotConfig"),   DEFAULTS["LotConfig"])
    clean["BldgType"]    = safe_str(raw.get("BldgType"),    DEFAULTS["BldgType"])
    return clean
Identifier columns (Id), columns that are largely redundant with other features (YearRemodAdd), and columns with excessive cardinality that were not selected during feature engineering (Exterior1st, BsmtFinSF2) are removed. This reduces dimensionality and prevents the model from overfitting to noise.The selected feature set for the House Price model after dropping is:
FINAL_FEATURE_ORDER = [
    "MSSubClass", "LotArea", "OverallCond", "YearBuilt", "TotalBsmtSF",
    "MSZoning_FV", "MSZoning_RH", "MSZoning_RL", "MSZoning_RM",
    "LotConfig_CulDSac", "LotConfig_FR2", "LotConfig_FR3", "LotConfig_Inside",
    "BldgType_2fmCon", "BldgType_Duplex", "BldgType_Twnhs", "BldgType_TwnhsE",
]
String-valued categorical columns are expanded into binary dummy columns. Each unique category value becomes its own column, set to 1 if that category applies and 0 otherwise. One category per group is omitted (the reference category) to avoid multicollinearity.
def build_one_hot_vector(clean: dict) -> np.ndarray:
    row = {col: 0 for col in FINAL_FEATURE_ORDER}

    # Numeric base features pass through unchanged
    row["MSSubClass"]  = clean["MSSubClass"]
    row["LotArea"]     = clean["LotArea"]
    row["OverallCond"] = clean["OverallCond"]
    row["YearBuilt"]   = clean["YearBuilt"]
    row["TotalBsmtSF"] = clean["TotalBsmtSF"]

    # MSZoning: one of FV, RH, RL, RM
    row[f"MSZoning_{clean['MSZoning']}"] = 1

    # LotConfig: one of CulDSac, FR2, FR3, Inside
    row[f"LotConfig_{clean['LotConfig']}"] = 1

    # BldgType: 1Fam is the reference category (all zeros)
    if clean["BldgType"] != "1Fam":
        row[f"BldgType_{clean['BldgType']}"] = 1

    vector = [row[col] for col in FINAL_FEATURE_ORDER]
    return np.array(vector, dtype=float).reshape(1, -1)
Category validation is applied before encoding. If Gemini returns an unrecognized value for a categorical column, the pipeline falls back to the default category rather than raising an error.
pd.get_dummies() produces boolean (True/False) columns by default in recent pandas versions. sklearn estimators require numeric input, so boolean columns are cast to int (1/0).
# During training in the notebook
df = pd.get_dummies(df, columns=["MSZoning", "LotConfig", "BldgType"])
bool_cols = df.select_dtypes(include="bool").columns
df[bool_cols] = df[bool_cols].astype(int)
At inference time the build_one_hot_vector function constructs the vector directly as dtype=float, so no explicit boolean conversion is needed.
The dataset is split into 80% training and 20% testing before any model is fitted. The test set is held out entirely and is only used for final evaluation, preventing data leakage.
from sklearn.model_selection import train_test_split

X = df.drop(columns=["SalePrice"])
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
Tree-based models are scale-invariant, but linear models (Linear Regression, Ridge, Lasso) and distance-based models converge faster and perform better when features are on a comparable scale. Where scaling is applied, StandardScaler (zero mean, unit variance) is the default choice.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
Always call fit_transform on training data and transform (not fit_transform) on test data and live inference inputs. Fitting the scaler on test data would leak distribution information from the hold-out set.

Build docs developers (and LLMs) love