Features

Project Overview

The Diabetes Prediction ML project is a complete machine learning solution that predicts diabetes risk using patient medical and demographic data. The project demonstrates a full ML development lifecycle from exploration to production deployment.

Key Features

Machine Learning Model

RandomForestClassifier trained on 100,000 patient records with advanced handling for imbalanced datasets

Three Development Phases

Progressive evolution from Jupyter notebook exploration to REST API deployment

Docker Containerization

Fully containerized deployments for both CLI and API interfaces

FastAPI REST API

Production-ready API with automatic Swagger documentation

Dataset Characteristics

The project uses the Diabetes Prediction Dataset from Kaggle, containing:

100,000 patient records
8 input features (medical and demographic)
1 target variable (diabetes status: 0 or 1)

The dataset exhibits significant class imbalance, with far fewer positive diabetes cases than negative cases. This is addressed using SMOTEENN resampling technique.

Patient Features

The model analyzes the following patient characteristics:

Feature	Type	Description	Example Values
`gender`	Categorical	Patient’s gender	Female, Male, Other
`age`	Numeric	Patient’s age in years	36, 54, 80
`hypertension`	Binary	Presence of hypertension	0 (no), 1 (yes)
`heart_disease`	Binary	Presence of heart disease	0 (no), 1 (yes)
`smoking_history`	Categorical	Smoking status	never, current, former, ever, not current, No Info
`bmi`	Numeric	Body Mass Index	23.45, 27.32, 32.27
`HbA1c_level`	Numeric	Hemoglobin A1c level (%)	5.0, 6.2, 6.6
`blood_glucose_level`	Numeric	Blood glucose level (mg/dL)	140, 158, 220

Technical Stack

Core Libraries
API Framework
Utilities

scikit-learn==1.4.1.post1    # Machine learning algorithms
pandas==2.2.1                 # Data manipulation
imbalanced-learn==0.12.0      # SMOTEENN for imbalanced data

fastapi==0.111.0              # REST API framework
pydantic                      # Data validation
uvicorn                       # ASGI server

loguru==0.7.2                 # Structured logging
argparse                      # CLI argument parsing

Development Phases

Phase 1: Exploration

Interactive Jupyter notebook for data exploration, model training, and evaluation in Google Colab.Best for: Understanding the dataset, experimenting with features, and initial model development.

Phase 2: CLI Tools

Docker-based command-line interface with separate train.py and predict.py scripts.Best for: Batch predictions, automated pipelines, and local development.

Phase 3: REST API

FastAPI-based REST API with endpoints for training and real-time predictions.Best for: Production deployments, web integrations, and microservices architecture.

Model Performance

The RandomForestClassifier is trained with the following preprocessing pipeline:

Categorical Encoding: Gender and smoking history converted to numeric codes
Feature Scaling: StandardScaler normalization for all features
Resampling: SMOTEENN (SMOTE + Edited Nearest Neighbors) to handle class imbalance
Training: RandomForestClassifier with default hyperparameters

The model must be trained before making predictions. In Phase 2 and 3, ensure you run the training step first to generate the model.pkl file.

Use Cases

Healthcare professionals and researchers can use this system to:

Risk Assessment: Identify patients at high risk of developing diabetes
Early Detection: Screen large populations for diabetes indicators
Treatment Planning: Develop personalized prevention strategies
Research: Explore relationships between medical/demographic factors and diabetes likelihood

Next Steps

Quick Start

Get up and running with your first prediction in minutes

Dataset Details

Deep dive into the dataset structure and characteristics

Model Architecture

Understand the RandomForest model and preprocessing pipeline

Docker Setup

Deploy the application using Docker containers

Overview

Getting Started

Core Concepts

Deployment

Project Overview

Key Features

Machine Learning Model

Three Development Phases

Docker Containerization

FastAPI REST API

Dataset Characteristics

Patient Features

Technical Stack

Development Phases

Model Performance

Use Cases

Next Steps

Quick Start

Dataset Details

Model Architecture

Docker Setup

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Deployment

Documentation Index

​Project Overview

​Key Features

Machine Learning Model

Three Development Phases

Docker Containerization

FastAPI REST API

​Dataset Characteristics

​Patient Features

​Technical Stack

​Development Phases

​Model Performance

​Use Cases

​Next Steps

Quick Start

Dataset Details

Model Architecture

Docker Setup

Build docs developers (and LLMs) love

Project Overview

Key Features

Dataset Characteristics

Patient Features

Technical Stack

Development Phases

Model Performance

Use Cases

Next Steps