Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt
Use this file to discover all available pages before exploring further.
Dataset Overview
The Diabetes Prediction Dataset is a collection of medical and demographic data from patients, along with their diabetes status. This dataset is specifically designed for healthcare professionals to identify patients at risk of developing diabetes.Source: Kaggle - Diabetes Prediction DatasetSize: 100,000 patient recordsFeatures: 8 input features + 1 target variable
Dataset Structure
The dataset contains 100,000 rows and 9 columns with a mix of categorical, numeric, and binary features:Sample Data
Here’s what the raw data looks like:Feature Descriptions
1. Gender (Categorical)
1. Gender (Categorical)
Patient’s biological gender.Values:
Female(encoded as 0)Male(encoded as 1)Other(encoded as 2)
2. Age (Numeric)
2. Age (Numeric)
Patient’s age in years.Type: Float64Range: Varies from young adults to elderly patientsExample: 28.0, 36.0, 54.0, 76.0, 80.0
3. Hypertension (Binary)
3. Hypertension (Binary)
Indicates whether the patient has hypertension (high blood pressure).Values:
0: No hypertension1: Has hypertension
4. Heart Disease (Binary)
4. Heart Disease (Binary)
Indicates whether the patient has been diagnosed with heart disease.Values:
0: No heart disease1: Has heart disease
5. Smoking History (Categorical)
5. Smoking History (Categorical)
Patient’s smoking status and history.Values:
No Info(encoded as 0) - No information availablecurrent(encoded as 1) - Currently smokesever(encoded as 2) - Has smoked at some pointformer(encoded as 3) - Former smokernever(encoded as 4) - Never smokednot current(encoded as 5) - Not currently smoking
6. BMI - Body Mass Index (Numeric)
6. BMI - Body Mass Index (Numeric)
A measure of body fat based on height and weight.Type: Float64Formula: weight (kg) / height² (m²)Example Values: 20.14, 23.45, 25.19, 27.32, 32.27Interpretation:
- < 18.5: Underweight
- 18.5-24.9: Normal weight
- 25.0-29.9: Overweight
- ≥ 30.0: Obese
7. HbA1c Level (Numeric)
7. HbA1c Level (Numeric)
Hemoglobin A1c level - a measure of average blood glucose over the past 2-3 months.Type: Float64Unit: Percentage (%)Example Values: 4.8, 5.0, 5.7, 6.2, 6.6Clinical Significance:
- < 5.7%: Normal
- 5.7-6.4%: Prediabetes
- ≥ 6.5%: Diabetes
8. Blood Glucose Level (Numeric)
8. Blood Glucose Level (Numeric)
Current blood glucose (sugar) level measurement.Type: Int64Unit: mg/dL (milligrams per deciliter)Example Values: 80, 140, 155, 158, 220Clinical Significance:
- < 100 mg/dL: Normal (fasting)
- 100-125 mg/dL: Prediabetes (fasting)
- ≥ 126 mg/dL: Diabetes (fasting)
9. Diabetes (Target Variable)
9. Diabetes (Target Variable)
Whether the patient has been diagnosed with diabetes.Type: Binary (Int64)Values:
0: No diabetes1: Has diabetes
Data Quality
- Completeness
- Memory Usage
The dataset has no missing values across all 100,000 records:
Class Imbalance
The dataset exhibits significant class imbalance - a common challenge in medical datasets:Solution: SMOTEENN Resampling
The project addresses this using SMOTEENN (SMOTE + Edited Nearest Neighbors):- Oversamples the minority class (diabetes=1) using SMOTE
- Undersamples by removing noisy samples using Edited Nearest Neighbors
- Results in a more balanced training set
For more details on how imbalanced data is handled, see Imbalanced Data Handling.
Data Preprocessing Pipeline
Accessing the Dataset
Prerequisites
You need Kaggle API credentials to download the dataset:Create Kaggle Account
Sign up at kaggle.com if you don’t have an account
Generate API Token
- Go to your Kaggle account settings
- Scroll to “API” section
- Click “Create New API Token”
- This downloads
kaggle.jsonwith your credentials
Download Commands
Dataset Files
After downloading and unzipping, you’ll have:diabetes_prediction_dataset.csv- The full dataset with 100,000 records
train.csv- Training data (with diabetes column)test.csv- Test data (optionally without diabetes column for predictions)
Next Steps
Patient Features
Detailed medical interpretation of each feature
Data Preprocessing
Learn about the preprocessing pipeline
Imbalanced Data
Understanding SMOTEENN resampling technique
Quick Start
Start making predictions with the dataset