Data preprocessing is a critical step that transforms raw patient data into a format suitable for machine learning. The diabetes prediction system applies a consistent preprocessing pipeline across all three phases.
Pipeline Steps: Categorical Encoding → Feature Scaling → Resampling (training only)Goal: Convert mixed-type patient data into normalized numeric features
While the project uses label encoding, one-hot encoding is another option:
Label Encoding (Used)
One-Hot Encoding (Alternative)
# Single column with numeric codessmoking_history: [0, 1, 2, 3, 4, 5]# Pros:# - Compact (1 column)# - Works well with tree-based models# - Simple implementation# Cons:# - Implies ordinal relationship# - Not ideal for linear models
For RandomForest: Label encoding works fine because decision trees don’t assume ordinal relationships.For Linear/Logistic Regression: One-hot encoding is preferred.
Surprising Fact: RandomForest is scale-invariant - it doesn’t actually need feature scaling!Decision trees split on thresholds, so the scale doesn’t matter:
if bmi > 27.5 works the same as if bmi_scaled > 0.0
So why does the project scale features?
SMOTEENN requirement: The resampling algorithm uses distance metrics that ARE scale-sensitive
Future model flexibility: If you switch to a different model (SVM, Neural Network), scaling is already done
Critical Distinction: fit_transform() for training, transform() for prediction
Training (fit_transform)
Prediction (transform)
Common Mistake
# TRAINING: Learn statistics AND transformscaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)# Scaler learns:# - Mean of each feature from training data# - Std dev of each feature from training data# Then applies transformation using those statistics
# PREDICTION: Use training statisticsX_test_scaled = scaler.transform(X_test)# Scaler uses:# - Mean from TRAINING data (not test data)# - Std dev from TRAINING data (not test data)# This ensures consistent scaling
# WRONG: Don't do this!scaler_test = StandardScaler()X_test_scaled = scaler_test.fit_transform(X_test)# Problem:# - Uses test data statistics# - Different scale than training# - Leads to poor predictions# - Data leakage
Problem in Phase 2 & 3: The prediction scripts create a NEW scaler:
# predict.py (INCORRECT)scaler = StandardScaler()Xts = scaler.fit_transform(Xts) # Uses test data statistics!
This is incorrect and may reduce accuracy.
Recommended Fix:Save the scaler during training:
# train.pyimport picklescaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Save scaler with modelwith open("model.pkl", "wb") as f: pickle.dump({'model': m, 'scaler': scaler}, f)
Load the scaler during prediction:
# predict.pyimport pickle# Load scaler and modelwith open("model.pkl", "rb") as f: saved = pickle.load(f) m = saved['model'] scaler = saved['scaler']# Use TRAINING scalerXts = scaler.transform(Xts) # Correct!
Possible Cause: Different scaling for training vs predictionSolution: Use the SAME scaler for both:
# Save training scalerpickle.dump(scaler, open('scaler.pkl', 'wb'))# Load in predictionscaler = pickle.load(open('scaler.pkl', 'rb'))X_scaled = scaler.transform(X) # Not fit_transform!
KeyError during encoding
Cause: Input has category not in encoding dictionaryExample: Gender = “Non-binary” but only have Female/Male/OtherSolution: Add validation or handle unknown values:
# Validation approachvalid_values = set(gender_dict.keys())invalid = set(data['gender']) - valid_valuesif invalid: raise ValueError(f"Unknown gender values: {invalid}")# Or map unknown to "Other"data['gender'] = data['gender'].apply( lambda x: x if x in gender_dict else 'Other')