Preprocessing pipelines: clean, encode, and transform

Preprocessing Technique	When to Use	Implementation
Drop high-missing columns	Columns with >40% missing values add noise without signal	`df.drop(columns=[...])`
Fill remaining missing values	Sparse missingness that can be imputed statistically	`fillna(mean)` / `fillna(mode)` or hardcoded defaults
Drop irrelevant identifiers	Columns like `Id` that carry no predictive signal	`df.drop(columns=["Id", ...])`
One-hot encode categoricals	Convert string categories to binary dummy columns	`pd.get_dummies()` / manual mapping
Convert booleans to int	sklearn models require numeric input	`.astype(int)`
Train / test split (80/20)	Hold out evaluation data before any fitting	`train_test_split(..., test_size=0.2)`
Feature scaling	Distance-based or gradient-based models are sensitive to scale	`StandardScaler` / `MinMaxScaler`

1. Dropping columns with high missing rate

Columns where more than roughly 40% of rows are NaN are removed before any imputation. Keeping them would force the model to learn from predominantly filled-in guesses, which degrades generalization.In the House Price project the following columns are dropped outright:

COLUMNS_TO_DROP = ["Id", "YearRemodAdd", "Exterior1st", "BsmtFinSF2"]

df = df.drop(columns=COLUMNS_TO_DROP)

Rows where the target variable (SalePrice) is missing are also removed, because a training example without a label cannot contribute to supervised learning.

df = df.dropna(subset=["SalePrice"])

2. Filling remaining missing values

After the high-missingness columns are gone, sparse gaps in the remaining columns are filled with statistical defaults. For inference, hardcoded dataset-level mean and mode values are used so that the live API does not need access to the training set.

# Defaults derived from training set statistics
DEFAULTS = {
    "MSSubClass": 20,       # mode
    "MSZoning": "RL",       # mode
    "LotArea": 9500,        # mean
    "LotConfig": "Inside",  # mode
    "BldgType": "1Fam",     # mode
    "OverallCond": 5,       # mode
    "YearBuilt": 1975,      # mean
    "TotalBsmtSF": 900,     # mean
}

def fill_missing(raw: dict) -> dict:
    clean = {}
    clean["MSSubClass"]  = safe_int(raw.get("MSSubClass"),  DEFAULTS["MSSubClass"])
    clean["LotArea"]     = safe_int(raw.get("LotArea"),     DEFAULTS["LotArea"])
    clean["OverallCond"] = safe_int(raw.get("OverallCond"), DEFAULTS["OverallCond"])
    clean["YearBuilt"]   = safe_int(raw.get("YearBuilt"),   DEFAULTS["YearBuilt"])
    clean["TotalBsmtSF"] = safe_float(raw.get("TotalBsmtSF"), DEFAULTS["TotalBsmtSF"])
    clean["MSZoning"]    = safe_str(raw.get("MSZoning"),    DEFAULTS["MSZoning"])
    clean["LotConfig"]   = safe_str(raw.get("LotConfig"),   DEFAULTS["LotConfig"])
    clean["BldgType"]    = safe_str(raw.get("BldgType"),    DEFAULTS["BldgType"])
    return clean

3. Dropping irrelevant columns

Identifier columns (Id), columns that are largely redundant with other features (YearRemodAdd), and columns with excessive cardinality that were not selected during feature engineering (Exterior1st, BsmtFinSF2) are removed. This reduces dimensionality and prevents the model from overfitting to noise.The selected feature set for the House Price model after dropping is:

FINAL_FEATURE_ORDER = [
    "MSSubClass", "LotArea", "OverallCond", "YearBuilt", "TotalBsmtSF",
    "MSZoning_FV", "MSZoning_RH", "MSZoning_RL", "MSZoning_RM",
    "LotConfig_CulDSac", "LotConfig_FR2", "LotConfig_FR3", "LotConfig_Inside",
    "BldgType_2fmCon", "BldgType_Duplex", "BldgType_Twnhs", "BldgType_TwnhsE",
]

4. One-hot encoding categorical features

String-valued categorical columns are expanded into binary dummy columns. Each unique category value becomes its own column, set to 1 if that category applies and 0 otherwise. One category per group is omitted (the reference category) to avoid multicollinearity.

def build_one_hot_vector(clean: dict) -> np.ndarray:
    row = {col: 0 for col in FINAL_FEATURE_ORDER}

    # Numeric base features pass through unchanged
    row["MSSubClass"]  = clean["MSSubClass"]
    row["LotArea"]     = clean["LotArea"]
    row["OverallCond"] = clean["OverallCond"]
    row["YearBuilt"]   = clean["YearBuilt"]
    row["TotalBsmtSF"] = clean["TotalBsmtSF"]

    # MSZoning: one of FV, RH, RL, RM
    row[f"MSZoning_{clean['MSZoning']}"] = 1

    # LotConfig: one of CulDSac, FR2, FR3, Inside
    row[f"LotConfig_{clean['LotConfig']}"] = 1

    # BldgType: 1Fam is the reference category (all zeros)
    if clean["BldgType"] != "1Fam":
        row[f"BldgType_{clean['BldgType']}"] = 1

    vector = [row[col] for col in FINAL_FEATURE_ORDER]
    return np.array(vector, dtype=float).reshape(1, -1)

Category validation is applied before encoding. If Gemini returns an unrecognized value for a categorical column, the pipeline falls back to the default category rather than raising an error.

5. Converting boolean features to int

pd.get_dummies() produces boolean (True/False) columns by default in recent pandas versions. sklearn estimators require numeric input, so boolean columns are cast to int (1/0).

# During training in the notebook
df = pd.get_dummies(df, columns=["MSZoning", "LotConfig", "BldgType"])
bool_cols = df.select_dtypes(include="bool").columns
df[bool_cols] = df[bool_cols].astype(int)

At inference time the build_one_hot_vector function constructs the vector directly as dtype=float, so no explicit boolean conversion is needed.

6. Train / test split (80/20)

The dataset is split into 80% training and 20% testing before any model is fitted. The test set is held out entirely and is only used for final evaluation, preventing data leakage.

from sklearn.model_selection import train_test_split

X = df.drop(columns=["SalePrice"])
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

7. Feature scaling

Tree-based models are scale-invariant, but linear models (Linear Regression, Ridge, Lasso) and distance-based models converge faster and perform better when features are on a comparable scale. Where scaling is applied, StandardScaler (zero mean, unit variance) is the default choice.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

Always call fit_transform on training data and transform (not fit_transform) on test data and live inference inputs. Fitting the scaler on test data would leak distribution information from the hold-out set.

ML Pipeline

Resources

Preprocessing pipelines: clean, encode, and transform

Preprocessing steps at a glance

Step-by-step breakdown

Build docs developers (and LLMs) love

ML Pipeline

Resources

Documentation Index

​Preprocessing steps at a glance

​Step-by-step breakdown

Build docs developers (and LLMs) love

Preprocessing steps at a glance

Step-by-step breakdown