Dataset

Dataset Overview

The Diabetes Prediction Dataset is a collection of medical and demographic data from patients, along with their diabetes status. This dataset is specifically designed for healthcare professionals to identify patients at risk of developing diabetes.

Source: Kaggle - Diabetes Prediction DatasetSize: 100,000 patient recordsFeatures: 8 input features + 1 target variable

Dataset Structure

The dataset contains 100,000 rows and 9 columns with a mix of categorical, numeric, and binary features:

# Dataset shape
(100000, 9)

# Data types
gender                 object   # Categorical
age                   float64   # Numeric
hypertension            int64   # Binary
heart_disease           int64   # Binary  
smoking_history        object   # Categorical
bmi                   float64   # Numeric
HbA1c_level           float64   # Numeric
blood_glucose_level     int64   # Numeric
diabetes                int64   # Target (Binary)

Sample Data

Here’s what the raw data looks like:

   gender   age  hypertension  heart_disease smoking_history    bmi  HbA1c_level  blood_glucose_level  diabetes
Female  80.0             0              1           never  25.19          6.6                  140         0
Female  54.0             0              0         No Info  27.32          6.6                   80         0
  Male  28.0             0              0           never  27.32          5.7                  158         0
Female  36.0             0              0         current  23.45          5.0                  155         0
  Male  76.0             1              1         current  20.14          4.8                  155         0

Feature Descriptions

1. Gender (Categorical)

Patient’s biological gender.Values:

Female (encoded as 0)
Male (encoded as 1)
Other (encoded as 2)

Encoding Used:

gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}

2. Age (Numeric)

Patient’s age in years.Type: Float64Range: Varies from young adults to elderly patientsExample: 28.0, 36.0, 54.0, 76.0, 80.0

3. Hypertension (Binary)

Indicates whether the patient has hypertension (high blood pressure).Values:

0: No hypertension
1: Has hypertension

Significance: Hypertension is a known risk factor for diabetes

4. Heart Disease (Binary)

Indicates whether the patient has been diagnosed with heart disease.Values:

0: No heart disease
1: Has heart disease

Significance: Cardiovascular conditions often correlate with metabolic disorders

5. Smoking History (Categorical)

Patient’s smoking status and history.Values:

No Info (encoded as 0) - No information available
current (encoded as 1) - Currently smokes
ever (encoded as 2) - Has smoked at some point
former (encoded as 3) - Former smoker
never (encoded as 4) - Never smoked
not current (encoded as 5) - Not currently smoking

Encoding Used:

smoking_history_dict = {
    'No Info': 0, 
    'current': 1, 
    'ever': 2, 
    'former': 3, 
    'never': 4, 
    'not current': 5
}

6. BMI - Body Mass Index (Numeric)

A measure of body fat based on height and weight.Type: Float64Formula: weight (kg) / height² (m²)Example Values: 20.14, 23.45, 25.19, 27.32, 32.27Interpretation:

< 18.5: Underweight
18.5-24.9: Normal weight
25.0-29.9: Overweight
≥ 30.0: Obese

7. HbA1c Level (Numeric)

Hemoglobin A1c level - a measure of average blood glucose over the past 2-3 months.Type: Float64Unit: Percentage (%)Example Values: 4.8, 5.0, 5.7, 6.2, 6.6Clinical Significance:

< 5.7%: Normal
5.7-6.4%: Prediabetes
≥ 6.5%: Diabetes

8. Blood Glucose Level (Numeric)

Current blood glucose (sugar) level measurement.Type: Int64Unit: mg/dL (milligrams per deciliter)Example Values: 80, 140, 155, 158, 220Clinical Significance:

< 100 mg/dL: Normal (fasting)
100-125 mg/dL: Prediabetes (fasting)
≥ 126 mg/dL: Diabetes (fasting)

9. Diabetes (Target Variable)

Whether the patient has been diagnosed with diabetes.Type: Binary (Int64)Values:

0: No diabetes
1: Has diabetes

Distribution: The dataset is imbalanced with significantly more negative cases

Data Quality

Completeness
Memory Usage

The dataset has no missing values across all 100,000 records:

# All columns have 100,000 non-null values
gender               100000 non-null
age                  100000 non-null
hypertension         100000 non-null
heart_disease        100000 non-null
smoking_history      100000 non-null
bmi                  100000 non-null
HbA1c_level          100000 non-null
blood_glucose_level  100000 non-null
diabetes             100000 non-null

The dataset is memory-efficient:

# Total memory usage
memory usage: 6.9+ MB

Class Imbalance

The dataset exhibits significant class imbalance - a common challenge in medical datasets:

The number of patients without diabetes (class 0) far exceeds those with diabetes (class 1). This imbalance can cause models to be biased toward predicting the majority class.

Solution: SMOTEENN Resampling

The project addresses this using SMOTEENN (SMOTE + Edited Nearest Neighbors):

from imblearn.combine import SMOTEENN

# Apply over and undersampling with SMOTEENN
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)

This technique:

Oversamples the minority class (diabetes=1) using SMOTE
Undersamples by removing noisy samples using Edited Nearest Neighbors
Results in a more balanced training set

For more details on how imbalanced data is handled, see Imbalanced Data Handling.

Data Preprocessing Pipeline

Categorical Encoding

Convert gender and smoking_history to numeric codes:

gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {
    'No Info': 0, 'current': 1, 'ever': 2, 
    'former': 3, 'never': 4, 'not current': 5
}
data = data.replace({
    'gender': gender_dict, 
    'smoking_history': smoking_history_dict
})

Feature Scaling

Normalize all features using StandardScaler:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Resampling

Balance the dataset using SMOTEENN:

from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=42)
X_balanced, y_balanced = smote_enn.fit_resample(X_scaled, y)

Accessing the Dataset

Prerequisites

You need Kaggle API credentials to download the dataset:

Create Kaggle Account

Generate API Token

Go to your Kaggle account settings
Scroll to “API” section
Click “Create New API Token”
This downloads kaggle.json with your credentials

Use in Phase 1

Upload kaggle.json to Google Colab when running the Phase 1 notebook

Download Commands

# Setup Kaggle credentials
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/

# Download dataset
!kaggle datasets download -d iammustafatz/diabetes-prediction-dataset
!unzip -q diabetes-prediction-dataset.zip -d .

Dataset Files

After downloading and unzipping, you’ll have:

diabetes_prediction_dataset.csv - The full dataset with 100,000 records

For Phase 2 (CLI) and Phase 3 (API), you’ll need to split this into:

train.csv - Training data (with diabetes column)
test.csv - Test data (optionally without diabetes column for predictions)

Next Steps

Patient Features

Detailed medical interpretation of each feature

Data Preprocessing

Learn about the preprocessing pipeline

Imbalanced Data

Understanding SMOTEENN resampling technique

Quick Start

Start making predictions with the dataset

Overview

Getting Started

Core Concepts

Deployment

Dataset Overview

Dataset Structure

Sample Data

Feature Descriptions

Data Quality

Class Imbalance

Solution: SMOTEENN Resampling

Data Preprocessing Pipeline

Accessing the Dataset

Prerequisites

Download Commands

Dataset Files

Next Steps

Patient Features

Data Preprocessing

Imbalanced Data

Quick Start

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Deployment

Documentation Index

​Dataset Overview

​Dataset Structure

​Sample Data

​Feature Descriptions

​Data Quality

​Class Imbalance

​Solution: SMOTEENN Resampling

​Data Preprocessing Pipeline

​Accessing the Dataset

​Prerequisites

​Download Commands

​Dataset Files

​Next Steps

Patient Features

Data Preprocessing

Imbalanced Data

Quick Start

Build docs developers (and LLMs) love

Dataset Overview

Dataset Structure

Sample Data

Feature Descriptions

Data Quality

Class Imbalance

Solution: SMOTEENN Resampling

Data Preprocessing Pipeline

Accessing the Dataset

Prerequisites

Download Commands

Dataset Files

Next Steps