Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt

Use this file to discover all available pages before exploring further.

Dataset Overview

The Diabetes Prediction Dataset is a collection of medical and demographic data from patients, along with their diabetes status. This dataset is specifically designed for healthcare professionals to identify patients at risk of developing diabetes.
Source: Kaggle - Diabetes Prediction DatasetSize: 100,000 patient recordsFeatures: 8 input features + 1 target variable

Dataset Structure

The dataset contains 100,000 rows and 9 columns with a mix of categorical, numeric, and binary features:
# Dataset shape
(100000, 9)

# Data types
gender                 object   # Categorical
age                   float64   # Numeric
hypertension            int64   # Binary
heart_disease           int64   # Binary  
smoking_history        object   # Categorical
bmi                   float64   # Numeric
HbA1c_level           float64   # Numeric
blood_glucose_level     int64   # Numeric
diabetes                int64   # Target (Binary)

Sample Data

Here’s what the raw data looks like:
   gender   age  hypertension  heart_disease smoking_history    bmi  HbA1c_level  blood_glucose_level  diabetes
0  Female  80.0             0              1           never  25.19          6.6                  140         0
1  Female  54.0             0              0         No Info  27.32          6.6                   80         0
2    Male  28.0             0              0           never  27.32          5.7                  158         0
3  Female  36.0             0              0         current  23.45          5.0                  155         0
4    Male  76.0             1              1         current  20.14          4.8                  155         0

Feature Descriptions

Patient’s biological gender.Values:
  • Female (encoded as 0)
  • Male (encoded as 1)
  • Other (encoded as 2)
Encoding Used:
gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
Patient’s age in years.Type: Float64Range: Varies from young adults to elderly patientsExample: 28.0, 36.0, 54.0, 76.0, 80.0
Indicates whether the patient has hypertension (high blood pressure).Values:
  • 0: No hypertension
  • 1: Has hypertension
Significance: Hypertension is a known risk factor for diabetes
Indicates whether the patient has been diagnosed with heart disease.Values:
  • 0: No heart disease
  • 1: Has heart disease
Significance: Cardiovascular conditions often correlate with metabolic disorders
Patient’s smoking status and history.Values:
  • No Info (encoded as 0) - No information available
  • current (encoded as 1) - Currently smokes
  • ever (encoded as 2) - Has smoked at some point
  • former (encoded as 3) - Former smoker
  • never (encoded as 4) - Never smoked
  • not current (encoded as 5) - Not currently smoking
Encoding Used:
smoking_history_dict = {
    'No Info': 0, 
    'current': 1, 
    'ever': 2, 
    'former': 3, 
    'never': 4, 
    'not current': 5
}
A measure of body fat based on height and weight.Type: Float64Formula: weight (kg) / height² (m²)Example Values: 20.14, 23.45, 25.19, 27.32, 32.27Interpretation:
  • < 18.5: Underweight
  • 18.5-24.9: Normal weight
  • 25.0-29.9: Overweight
  • ≥ 30.0: Obese
Hemoglobin A1c level - a measure of average blood glucose over the past 2-3 months.Type: Float64Unit: Percentage (%)Example Values: 4.8, 5.0, 5.7, 6.2, 6.6Clinical Significance:
  • < 5.7%: Normal
  • 5.7-6.4%: Prediabetes
  • ≥ 6.5%: Diabetes
Current blood glucose (sugar) level measurement.Type: Int64Unit: mg/dL (milligrams per deciliter)Example Values: 80, 140, 155, 158, 220Clinical Significance:
  • < 100 mg/dL: Normal (fasting)
  • 100-125 mg/dL: Prediabetes (fasting)
  • ≥ 126 mg/dL: Diabetes (fasting)
Whether the patient has been diagnosed with diabetes.Type: Binary (Int64)Values:
  • 0: No diabetes
  • 1: Has diabetes
Distribution: The dataset is imbalanced with significantly more negative cases

Data Quality

The dataset has no missing values across all 100,000 records:
# All columns have 100,000 non-null values
gender               100000 non-null
age                  100000 non-null
hypertension         100000 non-null
heart_disease        100000 non-null
smoking_history      100000 non-null
bmi                  100000 non-null
HbA1c_level          100000 non-null
blood_glucose_level  100000 non-null
diabetes             100000 non-null

Class Imbalance

The dataset exhibits significant class imbalance - a common challenge in medical datasets:
The number of patients without diabetes (class 0) far exceeds those with diabetes (class 1). This imbalance can cause models to be biased toward predicting the majority class.

Solution: SMOTEENN Resampling

The project addresses this using SMOTEENN (SMOTE + Edited Nearest Neighbors):
from imblearn.combine import SMOTEENN

# Apply over and undersampling with SMOTEENN
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)
This technique:
  1. Oversamples the minority class (diabetes=1) using SMOTE
  2. Undersamples by removing noisy samples using Edited Nearest Neighbors
  3. Results in a more balanced training set
For more details on how imbalanced data is handled, see Imbalanced Data Handling.

Data Preprocessing Pipeline

1

Categorical Encoding

Convert gender and smoking_history to numeric codes:
gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {
    'No Info': 0, 'current': 1, 'ever': 2, 
    'former': 3, 'never': 4, 'not current': 5
}
data = data.replace({
    'gender': gender_dict, 
    'smoking_history': smoking_history_dict
})
2

Feature Scaling

Normalize all features using StandardScaler:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
3

Resampling

Balance the dataset using SMOTEENN:
from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=42)
X_balanced, y_balanced = smote_enn.fit_resample(X_scaled, y)

Accessing the Dataset

Prerequisites

You need Kaggle API credentials to download the dataset:
1

Create Kaggle Account

Sign up at kaggle.com if you don’t have an account
2

Generate API Token

  1. Go to your Kaggle account settings
  2. Scroll to “API” section
  3. Click “Create New API Token”
  4. This downloads kaggle.json with your credentials
3

Use in Phase 1

Upload kaggle.json to Google Colab when running the Phase 1 notebook

Download Commands

# Setup Kaggle credentials
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/

# Download dataset
!kaggle datasets download -d iammustafatz/diabetes-prediction-dataset
!unzip -q diabetes-prediction-dataset.zip -d .

Dataset Files

After downloading and unzipping, you’ll have:
  • diabetes_prediction_dataset.csv - The full dataset with 100,000 records
For Phase 2 (CLI) and Phase 3 (API), you’ll need to split this into:
  • train.csv - Training data (with diabetes column)
  • test.csv - Test data (optionally without diabetes column for predictions)

Next Steps

Patient Features

Detailed medical interpretation of each feature

Data Preprocessing

Learn about the preprocessing pipeline

Imbalanced Data

Understanding SMOTEENN resampling technique

Quick Start

Start making predictions with the dataset

Build docs developers (and LLMs) love