Phase 1: Jupyter Notebook

Overview

Phase 1 uses a Jupyter notebook (Diabetes_Prediction.ipynb) for interactive data exploration, visualization, model training, and evaluation. This phase is perfect for understanding the dataset, experimenting with different approaches, and developing the initial model.

Best For: Data scientists and ML engineers who want to explore the data, visualize patterns, and iterate on model development.Platform: Google Colab (free, no local setup required)

Prerequisites

Kaggle Account

Create a free account at kaggle.com

Kaggle API Credentials

Go to your Kaggle account settings
Scroll to “API” section
Click “Create New API Token”
Download kaggle.json (you’ll upload this to Colab)

Google Colab Access

Navigate to Google Colab - no installation needed

Getting Started

Open Notebook in Colab

Go to Google Colab
Click File → Upload notebook
Navigate to ~/workspace/source/fase-1/Diabetes_Prediction.ipynb
Upload the file

Alternatively, if using Colab’s GitHub integration:

File → Open notebook → GitHub tab

Upload Kaggle Credentials

When you run the first code cell, you’ll need to upload your kaggle.json file:

# This cell downloads data from Kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!kaggle datasets download -d iammustafatz/diabetes-prediction-dataset
!unzip -q diabetes-prediction-dataset.zip -d .

The notebook will prompt you to upload kaggle.json.

Run All Cells

Execute the notebook cells sequentially:

Runtime → Run all

Or use Shift + Enter to run cells one by one.

Notebook Structure

The notebook is organized into logical sections:

1. Overview

Dataset Introduction

The notebook begins with a description of the dataset:

“The following data is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). This dataset includes 100,000 rows and 9 features:”

Features listed:

gender
age
hypertension
heart_disease
smoking_history
body mass index (bmi)
HbA1c_level
blood_glucose_level
diabetes (target variable)

“Healthcare professionals may find this data useful in identifying patients at risk of developing diabetes and in developing personalized treatment plans.”

2. Data Download

# Download data from Kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!kaggle datasets download -d iammustafatz/diabetes-prediction-dataset
!unzip -q diabetes-prediction-dataset.zip -d .

3. Import Libraries

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.combine import SMOTEENN

The notebook uses:

pandas for data manipulation
matplotlib/seaborn for visualization
sklearn for ML algorithms
imblearn for handling imbalanced data

4. Load and Explore Data

# Load the data
data = pd.read_csv("diabetes_prediction_dataset.csv")
data = data[:100000]
print(data.shape)  # (100000, 9)
data.head()

5. Visualize Class Distribution

# Verify the distribution of classes
sns.countplot(data=data, x='diabetes', palette=['skyblue', 'lightcoral'], hue='diabetes')
plt.legend(['Sin diabetes', 'Con diabetes'])
plt.title('Diabetes Distribution')
plt.show()

This visualization reveals significant class imbalance - there are far more patients without diabetes than with diabetes. This is addressed later using SMOTEENN.

6. Encode Categorical Variables

# Encode categorical variables (gender and smoking_history) as numeric
le = LabelEncoder()
data['gender'] = le.fit_transform(data['gender'])
data['smoking_history'] = le.fit_transform(data['smoking_history'])

# Verify encoding
data.head()

While the notebook uses LabelEncoder, the production code (Phase 2 & 3) uses explicit dictionaries for more control:

gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {'No Info': 0, 'current': 1, 'ever': 2, 'former': 3, 'never': 4, 'not current': 5}

7. Split Features and Target

# Divide data into features (X) and output (y)
X = data.drop('diabetes', axis=1)
y = data[['diabetes']]

8. Train-Test Split

# Split 70% for training and validation, 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

Split Ratio: 70% training, 30% testingStratification: The notebook doesn’t explicitly use stratification, but you could add stratify=y to ensure balanced splits.

9. Feature Scaling

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

StandardScaler normalizes features to have:

Mean = 0
Standard deviation = 1

This is crucial for algorithms sensitive to feature scales.

10. Handle Imbalanced Data

# Apply oversampling and undersampling with SMOTEENN
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)

What is SMOTEENN?

SMOTEENN combines two techniques:

SMOTE (Synthetic Minority Over-sampling Technique):
- Creates synthetic samples of the minority class (diabetes=1)
- Interpolates between existing minority samples
ENN (Edited Nearest Neighbors):
- Removes noisy samples from both classes
- Cleans up overlap between classes

Result: A more balanced dataset with cleaner decision boundaries.See Imbalanced Data Handling for more details.

11. Train Model

# Train the model
model = RandomForestClassifier()
model.fit(X_resampled, y_resampled)

Why RandomForestClassifier?

RandomForest is an ensemble learning method that:

Builds multiple decision trees
Aggregates their predictions (voting)
Handles non-linear relationships well
Resistant to overfitting
Works well out-of-the-box with default parameters

Default Parameters Used:

n_estimators=100 (number of trees)
max_depth=None (nodes expanded until pure)
min_samples_split=2
random_state=None (random)

12. Make Predictions

# Scale test data
X_test_scaled = scaler.transform(X_test)

# Make predictions
y_pred = model.predict(X_test_scaled)

13. Evaluate Model

# Generate classification report
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

This produces a detailed report with:

Precision: What % of positive predictions were correct?
Recall: What % of actual positives were found?
F1-Score: Harmonic mean of precision and recall
Support: Number of samples in each class

Key Concepts Demonstrated

Data Exploration

Loading, inspecting, and visualizing the dataset to understand patterns and distributions

Preprocessing

Encoding categorical variables and scaling numeric features for model training

Class Imbalance

Using SMOTEENN to create a balanced training set from imbalanced medical data

Model Training

Training RandomForestClassifier and evaluating performance with classification metrics

Experimentation Ideas

The notebook is perfect for trying different approaches:

Different Models
Hyperparameter Tuning
Feature Importance
Different Resampling

Try other classifiers:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier

# Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_resampled, y_resampled)

# Support Vector Machine
model = SVC(kernel='rbf')
model.fit(X_resampled, y_resampled)

# Gradient Boosting
model = GradientBoostingClassifier(n_estimators=100)
model.fit(X_resampled, y_resampled)

Optimize RandomForest parameters:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier()
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1')
grid_search.fit(X_resampled, y_resampled)

print(f"Best parameters: {grid_search.best_params_}")

Analyze which features matter most:

# Train model
model = RandomForestClassifier()
model.fit(X_resampled, y_resampled)

# Get feature importance
feature_names = ['gender', 'age', 'hypertension', 'heart_disease', 
                 'smoking_history', 'bmi', 'HbA1c_level', 'blood_glucose_level']
importances = model.feature_importances_

# Visualize
import pandas as pd
feature_imp = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_imp = feature_imp.sort_values('importance', ascending=False)

sns.barplot(data=feature_imp, x='importance', y='feature')
plt.title('Feature Importance')
plt.show()

Try other techniques for imbalanced data:

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek

# SMOTE only
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train_scaled, y_train)

# ADASYN (adaptive synthetic sampling)
adasyn = ADASYN(random_state=42)
X_res, y_res = adasyn.fit_resample(X_train_scaled, y_train)

# SMOTETomek
smote_tomek = SMOTETomek(random_state=42)
X_res, y_res = smote_tomek.fit_resample(X_train_scaled, y_train)

Sample Output

When you run the notebook, you’ll see:

Data Preview

   gender   age  hypertension  heart_disease smoking_history    bmi  HbA1c_level  blood_glucose_level  diabetes
Female  80.0             0              1           never  25.19          6.6                  140         0
Female  54.0             0              0         No Info  27.32          6.6                   80         0
  Male  28.0             0              0           never  27.32          5.7                  158         0
Female  36.0             0              0         current  23.45          5.0                  155         0
  Male  76.0             1              1         current  20.14          4.8                  155         0

Class Distribution Plot

A bar chart showing the imbalance between diabetes=0 and diabetes=1 classes.

Classification Report

              precision    recall  f1-score   support

           0       0.96      0.97      0.97     28500
           1       0.85      0.82      0.83      1500

    accuracy                           0.95     30000
   macro avg       0.91      0.89      0.90     30000
weighted avg       0.95      0.95      0.95     30000

Actual numbers will vary based on the random train-test split.

Advantages of Phase 1

Interactive Exploration

Run code cells individually
See immediate visual feedback
Experiment without affecting production code
Easy to share results with team

No Local Setup

Runs entirely in Google Colab
No need to install Python or dependencies
Free GPU/TPU access available
Cloud storage integration

Documentation and Learning

Markdown cells explain each step
Code and results in one place
Perfect for presentations and reports
Educational for understanding ML workflow

Limitations

Phase 1 is not suitable for:

Production deployments
Automated pipelines
Serving predictions to end users
Batch processing large files

For these use cases, proceed to Phase 2 (CLI) or Phase 3 (API).

Next Steps

Experiment

Try different models, hyperparameters, and resampling techniques in the notebook

Move to CLI

Once satisfied with the model, proceed to Phase 2 for command-line tools

Deploy API

For production use, implement Phase 3 REST API

Phase 2: CLI

Command-line tools for batch predictions

Phase 3: API

REST API for production deployments

Model Architecture

Deep dive into RandomForest and preprocessing

Imbalanced Data

Understanding SMOTEENN technique

Overview

Getting Started

Core Concepts

Deployment

Overview

Prerequisites

Getting Started

Notebook Structure

1. Overview

2. Data Download

3. Import Libraries

4. Load and Explore Data

5. Visualize Class Distribution

6. Encode Categorical Variables

7. Split Features and Target

8. Train-Test Split

9. Feature Scaling

10. Handle Imbalanced Data

11. Train Model

12. Make Predictions

13. Evaluate Model

Key Concepts Demonstrated

Data Exploration

Preprocessing

Class Imbalance

Model Training

Experimentation Ideas

Sample Output

Data Preview

Class Distribution Plot

Classification Report

Advantages of Phase 1

Limitations

Next Steps

Phase 2: CLI

Phase 3: API

Model Architecture

Imbalanced Data

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Deployment

Documentation Index

​Overview

​Prerequisites

​Getting Started

​Notebook Structure

​1. Overview

​2. Data Download

​3. Import Libraries

​4. Load and Explore Data

​5. Visualize Class Distribution

​6. Encode Categorical Variables

​7. Split Features and Target

​8. Train-Test Split

​9. Feature Scaling

​10. Handle Imbalanced Data

​11. Train Model

​12. Make Predictions

​13. Evaluate Model

​Key Concepts Demonstrated

Data Exploration

Preprocessing

Class Imbalance

Model Training

​Experimentation Ideas

​Sample Output

​Data Preview

​Class Distribution Plot

​Classification Report

​Advantages of Phase 1

​Limitations

​Next Steps

Phase 2: CLI

Phase 3: API

Model Architecture

Imbalanced Data

Build docs developers (and LLMs) love

Overview

Prerequisites

Getting Started

Notebook Structure

1. Overview

2. Data Download

3. Import Libraries

4. Load and Explore Data

5. Visualize Class Distribution

6. Encode Categorical Variables

7. Split Features and Target

8. Train-Test Split

9. Feature Scaling

10. Handle Imbalanced Data

11. Train Model

12. Make Predictions

13. Evaluate Model

Key Concepts Demonstrated

Experimentation Ideas

Sample Output

Data Preview

Class Distribution Plot

Classification Report

Advantages of Phase 1

Limitations

Next Steps