The diabetes prediction dataset suffers from class imbalance - there are far more patients without diabetes than with diabetes. This page explains why imbalance is a problem and how SMOTEENN addresses it.
Class Imbalance: The dataset has approximately 10:1 ratio of non-diabetic to diabetic patients. Without intervention, the model would be biased toward predicting “no diabetes” for everyone.
Synthetic samples are plausible patients because they:
Fall within the feature space of real diabetes patients
Maintain correlations between features (e.g., high BMI with high glucose)
Don’t simply duplicate existing samples
Example:
# Real patient:{"age": 50, "bmi": 32, "HbA1c": 6.5}# Synthetic patient (interpolated):{"age": 51, "bmi": 33, "HbA1c": 6.6}# Both are realistic diabetes patients
Expands Decision Boundaries
Synthetic samples help the model learn the full range of diabetic patient characteristics:
Before SMOTE:Model sees 8,500 diabetes examples→ Limited view of diabetic patient spaceAfter SMOTE:Model sees 50,000+ diabetes examples→ Comprehensive view of diabetic patient space
Prevents Overfitting to Majority
With balanced classes, the model can’t achieve high accuracy by ignoring the minority class.
Imbalanced:"Predict all no diabetes" → 91.5% accuracyBalanced:"Predict all no diabetes" → 50% accuracy (useless)
Forces the model to learn discriminative patterns for both classes.
After SMOTE oversamples, ENN cleans up noisy and borderline samples:
1
For Each Sample
Examine each sample in the dataset (both classes)
2
Check Neighbors
Find the k nearest neighbors (default k=3)
# Example: Check this patientpatient_X = [Female, 45, 0, 0, never, 28, 5.8, 110] # diabetes=1# Find 3 nearest neighbors:neighbor_1: diabetes=0 # Different class!neighbor_2: diabetes=0 # Different class!neighbor_3: diabetes=1 # Same class# Majority class of neighbors: 0 (no diabetes)
3
Remove if Mismatched
If the majority of neighbors have a DIFFERENT class, remove the sample:
if majority_neighbor_class != sample_class: remove_sample(patient_X)# Reasoning: This patient is in a "no diabetes" region# but labeled as "diabetes" → likely noise or outlier
# Possible noise:{"age": 25, "bmi": 20, "HbA1c": 4.8, "glucose": 85, "diabetes": 1}# All features suggest no diabetes, but labeled as diabetes# Could be data entry error# ENN removes this to prevent confusing the model
Clarifies Class Boundaries
Borderline cases that overlap between classes are removed:
Feature Space:Before ENN:OOOOOOOOXXXOXXOOOO (O = no diabetes, X = diabetes)^^^^^ overlapping region ^^^^^After ENN:OOOOOOOO XXXXXX ^^^ clearer boundary
Improves model’s ability to distinguish classes.
Prevents SMOTE Artifacts
SMOTE can create synthetic samples in noisy regions:
# If SMOTE interpolates between two borderline samples,# it might create ambiguous synthetic samples# ENN cleans these up after SMOTE
SMOTEENN( random_state=42, # For reproducibility sampling_strategy='auto', # Balance to 1:1 ratio smote=None, # Use default SMOTE enn=None, # Use default ENN n_jobs=None # Single-threaded)
# Class weights (no resampling)from sklearn.ensemble import RandomForestClassifier# Automatically weight classes by inverse frequencymodel = RandomForestClassifier(class_weight='balanced')model.fit(X, y) # No need for resampling# Or manual weightsmodel = RandomForestClassifier(class_weight={0: 1, 1: 10})
SMOTEENN uses distance metrics, so feature scaling is essential.
2
Set Random State
smote_enn = SMOTEENN(random_state=42)
Ensures reproducible results across runs.
3
Only Resample Training Data
# Split firstX_train, X_test, y_train, y_test = train_test_split(X, y)# Then resample ONLY trainingX_train_res, y_train_res = smote_enn.fit_resample(X_train, y_train)# Test remains original distribution
4
Evaluate on Imbalanced Test Set
# Train on balanced datamodel.fit(X_train_resampled, y_train_resampled)# Test on imbalanced data (real-world distribution)predictions = model.predict(X_test)# Use metrics that account for imbalancefrom sklearn.metrics import classification_reportprint(classification_report(y_test, predictions))
Accuracy is misleading with imbalanced data. Use these instead:
Confusion Matrix
Precision & Recall
ROC-AUC
PR-AUC
from sklearn.metrics import confusion_matrixcm = confusion_matrix(y_true, y_pred)# Predicted# No Yes# Actual No | TN | FP |# Yes | FN | TP |TN, FP, FN, TP = cm.ravel()
# Precision: Of predicted diabetes, what % truly have it?precision = TP / (TP + FP)# Recall (Sensitivity): Of actual diabetes, what % did we catch?recall = TP / (TP + FN)# F1-Score: Harmonic meanf1 = 2 * (precision * recall) / (precision + recall)
For diabetes screening, prioritize high recall (catch all cases).
from sklearn.metrics import roc_auc_score, roc_curve# Area under ROC curveauc = roc_auc_score(y_true, y_pred_proba)# Interpretation:# 0.5 = Random guessing# 1.0 = Perfect classifier# 0.8-0.9 = Good
from sklearn.metrics import average_precision_score# Precision-Recall AUC (better for imbalanced data)pr_auc = average_precision_score(y_true, y_pred_proba)# More informative than ROC-AUC when classes are imbalanced