Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt
Use this file to discover all available pages before exploring further.
Project Overview
The Diabetes Prediction ML project is a complete machine learning solution that predicts diabetes risk using patient medical and demographic data. The project demonstrates a full ML development lifecycle from exploration to production deployment.Key Features
Machine Learning Model
RandomForestClassifier trained on 100,000 patient records with advanced handling for imbalanced datasets
Three Development Phases
Progressive evolution from Jupyter notebook exploration to REST API deployment
Docker Containerization
Fully containerized deployments for both CLI and API interfaces
FastAPI REST API
Production-ready API with automatic Swagger documentation
Dataset Characteristics
The project uses the Diabetes Prediction Dataset from Kaggle, containing:- 100,000 patient records
- 8 input features (medical and demographic)
- 1 target variable (diabetes status: 0 or 1)
The dataset exhibits significant class imbalance, with far fewer positive diabetes cases than negative cases. This is addressed using SMOTEENN resampling technique.
Patient Features
The model analyzes the following patient characteristics:| Feature | Type | Description | Example Values |
|---|---|---|---|
gender | Categorical | Patient’s gender | Female, Male, Other |
age | Numeric | Patient’s age in years | 36, 54, 80 |
hypertension | Binary | Presence of hypertension | 0 (no), 1 (yes) |
heart_disease | Binary | Presence of heart disease | 0 (no), 1 (yes) |
smoking_history | Categorical | Smoking status | never, current, former, ever, not current, No Info |
bmi | Numeric | Body Mass Index | 23.45, 27.32, 32.27 |
HbA1c_level | Numeric | Hemoglobin A1c level (%) | 5.0, 6.2, 6.6 |
blood_glucose_level | Numeric | Blood glucose level (mg/dL) | 140, 158, 220 |
Technical Stack
- Core Libraries
- API Framework
- Utilities
Development Phases
Phase 1: Exploration
Interactive Jupyter notebook for data exploration, model training, and evaluation in Google Colab.Best for: Understanding the dataset, experimenting with features, and initial model development.
Phase 2: CLI Tools
Docker-based command-line interface with separate
train.py and predict.py scripts.Best for: Batch predictions, automated pipelines, and local development.Model Performance
The RandomForestClassifier is trained with the following preprocessing pipeline:- Categorical Encoding: Gender and smoking history converted to numeric codes
- Feature Scaling: StandardScaler normalization for all features
- Resampling: SMOTEENN (SMOTE + Edited Nearest Neighbors) to handle class imbalance
- Training: RandomForestClassifier with default hyperparameters
Use Cases
Healthcare professionals and researchers can use this system to:- Risk Assessment: Identify patients at high risk of developing diabetes
- Early Detection: Screen large populations for diabetes indicators
- Treatment Planning: Develop personalized prevention strategies
- Research: Explore relationships between medical/demographic factors and diabetes likelihood
Next Steps
Quick Start
Get up and running with your first prediction in minutes
Dataset Details
Deep dive into the dataset structure and characteristics
Model Architecture
Understand the RandomForest model and preprocessing pipeline
Docker Setup
Deploy the application using Docker containers