Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

This project predicts individual medical insurance charges using a clean, structured dataset of demographic and health-related features. Four regression models are trained and compared — Linear Regression, Lasso (L1), Ridge (L2), and CatBoost — with all linear models achieving an R² of 0.858 and CatBoost providing a marginal improvement. The system is designed around a backend-first API architecture where a prediction endpoint handles structured JSON input and returns computed insurance charge estimates.

Overview

Medical cost prediction is a regression problem with a well-defined set of input features. The Kaggle insurance dataset is clean (no missing values), making it ideal for learning the full regression pipeline without extensive data imputation. Problem type: Regression
Target variable: charges (annual insurance cost, USD)
Dataset: insurance.csv (Kaggle Medical Insurance Dataset)
Key insights from EDA:
  • Smoking status is the strongest predictor of insurance charges
  • BMI and age show positive correlation with charges
  • Region and sex have lower predictive impact

Dataset

FeatureTypeDescription
ageNumericAge of the primary beneficiary
sexCategoricalInsurance contractor gender (male / female)
bmiNumericBody mass index
childrenNumericNumber of dependents covered
smokerCategoricalWhether the beneficiary smokes (yes / no)
regionCategoricalUS residential region (northeast, southeast, southwest, northwest)
chargesNumeric (target)Individual medical costs billed by health insurance
The dataset contains no missing values. Categorical variables (sex, smoker, region) are encoded before model training.

Preprocessing pipeline

Steps:
  1. Encode sex{male: 0, female: 1}
  2. Encode smoker{no: 0, yes: 1}
  3. One-hot encode region (drop first to avoid multicollinearity)
  4. Apply StandardScaler (required for linear models)
  5. Train/test split

Models implemented

The baseline model. Establishes a performance floor with no regularization penalty. Coefficients are fully interpretable — each unit change in a feature corresponds directly to a change in predicted charges.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

Model performance

ModelR² ScoreMAE (USD)
Linear Regression0.8582,759
Lasso (L1)0.8582,759
Ridge (L2)0.8582,759
CatBoost0.860~2,700
CatBoost provides a marginal improvement, but linear models remain efficient and interpretable. For production use cases where explainability matters (e.g., insurance pricing), Ridge or Lasso are preferred.

Model selection flow

API architecture

POST /predict

Accepts structured input fields and returns computed insurance charges. Request body
age
integer
required
Age of the primary beneficiary (years).
sex
string
required
Gender of the beneficiary. Accepted values: "male", "female".
bmi
number
required
Body mass index. Typical range: 15.0–53.0.
children
integer
required
Number of dependents covered by the insurance plan.
smoker
string
required
Smoking status. Accepted values: "yes", "no".
region
string
required
US residential region. Accepted values: "northeast", "southeast", "southwest", "northwest".
Example request
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "age": 35,
    "sex": "male",
    "bmi": 27.5,
    "children": 1,
    "smoker": "no",
    "region": "northeast"
  }'
Example response
{"predicted_charges": 5842.31}
predicted_charges
number
Estimated annual medical insurance cost in USD.

Running the project

1

Install dependencies

cd ML_To_Train/04_Medical_Cost_Fliter
pip install -r requirements.txt
2

Explore the dataset and train models

Open the project notebook to run EDA, preprocessing, and model training:
jupyter notebook Medical_Cost_Filter.ipynb
This project is implemented as a self-contained Jupyter notebook. It does not include a separate Flask API layer. To serve predictions as a REST endpoint, train and serialize the model from the notebook, then wrap it in a Flask route following the pattern in the API Integration reference.

Project structure

04_Medical_Cost_Fliter/

├── Medical_Cost_Filter.ipynb   # Full EDA, preprocessing, training notebook
├── requirements.txt
└── readme.md

Build docs developers (and LLMs) love