Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt

Use this file to discover all available pages before exploring further.

Project Overview

The Diabetes Prediction ML project is a complete machine learning solution that predicts diabetes risk using patient medical and demographic data. The project demonstrates a full ML development lifecycle from exploration to production deployment.

Key Features

Machine Learning Model

RandomForestClassifier trained on 100,000 patient records with advanced handling for imbalanced datasets

Three Development Phases

Progressive evolution from Jupyter notebook exploration to REST API deployment

Docker Containerization

Fully containerized deployments for both CLI and API interfaces

FastAPI REST API

Production-ready API with automatic Swagger documentation

Dataset Characteristics

The project uses the Diabetes Prediction Dataset from Kaggle, containing:
  • 100,000 patient records
  • 8 input features (medical and demographic)
  • 1 target variable (diabetes status: 0 or 1)
The dataset exhibits significant class imbalance, with far fewer positive diabetes cases than negative cases. This is addressed using SMOTEENN resampling technique.

Patient Features

The model analyzes the following patient characteristics:
FeatureTypeDescriptionExample Values
genderCategoricalPatient’s genderFemale, Male, Other
ageNumericPatient’s age in years36, 54, 80
hypertensionBinaryPresence of hypertension0 (no), 1 (yes)
heart_diseaseBinaryPresence of heart disease0 (no), 1 (yes)
smoking_historyCategoricalSmoking statusnever, current, former, ever, not current, No Info
bmiNumericBody Mass Index23.45, 27.32, 32.27
HbA1c_levelNumericHemoglobin A1c level (%)5.0, 6.2, 6.6
blood_glucose_levelNumericBlood glucose level (mg/dL)140, 158, 220

Technical Stack

scikit-learn==1.4.1.post1    # Machine learning algorithms
pandas==2.2.1                 # Data manipulation
imbalanced-learn==0.12.0      # SMOTEENN for imbalanced data

Development Phases

1

Phase 1: Exploration

Interactive Jupyter notebook for data exploration, model training, and evaluation in Google Colab.Best for: Understanding the dataset, experimenting with features, and initial model development.
2

Phase 2: CLI Tools

Docker-based command-line interface with separate train.py and predict.py scripts.Best for: Batch predictions, automated pipelines, and local development.
3

Phase 3: REST API

FastAPI-based REST API with endpoints for training and real-time predictions.Best for: Production deployments, web integrations, and microservices architecture.

Model Performance

The RandomForestClassifier is trained with the following preprocessing pipeline:
  1. Categorical Encoding: Gender and smoking history converted to numeric codes
  2. Feature Scaling: StandardScaler normalization for all features
  3. Resampling: SMOTEENN (SMOTE + Edited Nearest Neighbors) to handle class imbalance
  4. Training: RandomForestClassifier with default hyperparameters
The model must be trained before making predictions. In Phase 2 and 3, ensure you run the training step first to generate the model.pkl file.

Use Cases

Healthcare professionals and researchers can use this system to:
  • Risk Assessment: Identify patients at high risk of developing diabetes
  • Early Detection: Screen large populations for diabetes indicators
  • Treatment Planning: Develop personalized prevention strategies
  • Research: Explore relationships between medical/demographic factors and diabetes likelihood

Next Steps

Quick Start

Get up and running with your first prediction in minutes

Dataset Details

Deep dive into the dataset structure and characteristics

Model Architecture

Understand the RandomForest model and preprocessing pipeline

Docker Setup

Deploy the application using Docker containers

Build docs developers (and LLMs) love