Medical cost prediction from health and demographic data

This project predicts individual medical insurance charges using a clean, structured dataset of demographic and health-related features. Four regression models are trained and compared — Linear Regression, Lasso (L1), Ridge (L2), and CatBoost — with all linear models achieving an R² of 0.858 and CatBoost providing a marginal improvement. The system is designed around a backend-first API architecture where a prediction endpoint handles structured JSON input and returns computed insurance charge estimates.

Overview

Medical cost prediction is a regression problem with a well-defined set of input features. The Kaggle insurance dataset is clean (no missing values), making it ideal for learning the full regression pipeline without extensive data imputation. Problem type: Regression
Target variable: charges (annual insurance cost, USD)
Dataset: insurance.csv (Kaggle Medical Insurance Dataset) Key insights from EDA:

Smoking status is the strongest predictor of insurance charges
BMI and age show positive correlation with charges
Region and sex have lower predictive impact

Dataset

Feature	Type	Description
`age`	Numeric	Age of the primary beneficiary
`sex`	Categorical	Insurance contractor gender (`male` / `female`)
`bmi`	Numeric	Body mass index
`children`	Numeric	Number of dependents covered
`smoker`	Categorical	Whether the beneficiary smokes (`yes` / `no`)
`region`	Categorical	US residential region (`northeast`, `southeast`, `southwest`, `northwest`)
`charges`	Numeric (target)	Individual medical costs billed by health insurance

The dataset contains no missing values. Categorical variables (sex, smoker, region) are encoded before model training.

Preprocessing pipeline

Steps:

Encode sex → {male: 0, female: 1}
Encode smoker → {no: 0, yes: 1}
One-hot encode region (drop first to avoid multicollinearity)
Apply StandardScaler (required for linear models)
Train/test split

Models implemented

Linear Regression
Lasso (L1)
Ridge (L2)
CatBoost

The baseline model. Establishes a performance floor with no regularization penalty. Coefficients are fully interpretable — each unit change in a feature corresponds directly to a change in predicted charges.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

Adds an L1 penalty that drives less-predictive feature coefficients to zero. Effectively performs feature selection — in this dataset, sex and region tend to be down-weighted relative to smoker and bmi.

from sklearn.linear_model import Lasso
model = Lasso(alpha=1.0)
model.fit(X_train, y_train)

Adds an L2 penalty that shrinks coefficients without eliminating them. Useful when all features contribute and multicollinearity is present.

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

A gradient boosting model included for comparison. Handles categorical features natively (no encoding required) and captures non-linear interactions between features. Provides a small improvement over linear models.

from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=500, verbose=False)
model.fit(X_train, y_train, cat_features=['sex', 'smoker', 'region'])

Model performance

Model	R² Score	MAE (USD)
Linear Regression	0.858	2,759
Lasso (L1)	0.858	2,759
Ridge (L2)	0.858	2,759
CatBoost	0.860	~2,700

CatBoost provides a marginal improvement, but linear models remain efficient and interpretable. For production use cases where explainability matters (e.g., insurance pricing), Ridge or Lasso are preferred.

Model selection flow

API architecture

POST /predict

Accepts structured input fields and returns computed insurance charges. Request body

age

integer

required

Age of the primary beneficiary (years).

sex

string

required

Gender of the beneficiary. Accepted values: "male", "female".

bmi

number

required

Body mass index. Typical range: 15.0–53.0.

children

integer

required

Number of dependents covered by the insurance plan.

smoker

string

required

Smoking status. Accepted values: "yes", "no".

region

string

required

US residential region. Accepted values: "northeast", "southeast", "southwest", "northwest".

Example request

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "age": 35,
    "sex": "male",
    "bmi": 27.5,
    "children": 1,
    "smoker": "no",
    "region": "northeast"
  }'

Example response

{"predicted_charges": 5842.31}

predicted_charges

number

Estimated annual medical insurance cost in USD.

Running the project

Install dependencies

cd ML_To_Train/04_Medical_Cost_Fliter
pip install -r requirements.txt

Explore the dataset and train models

Open the project notebook to run EDA, preprocessing, and model training:

jupyter notebook Medical_Cost_Filter.ipynb

This project is implemented as a self-contained Jupyter notebook. It does not include a separate Flask API layer. To serve predictions as a REST endpoint, train and serialize the model from the notebook, then wrap it in a Flask route following the pattern in the API Integration reference.

Project structure

04_Medical_Cost_Fliter/
│
├── Medical_Cost_Filter.ipynb   # Full EDA, preprocessing, training notebook
├── requirements.txt
└── readme.md

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Medical cost prediction from health and demographic data

Overview

Dataset

Preprocessing pipeline

Models implemented

Model performance

Model selection flow

API architecture

POST /predict

Running the project

Project structure

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Overview

​Dataset

​Preprocessing pipeline

​Models implemented

​Model performance

​Model selection flow

​API architecture

​POST /predict

​Running the project

​Project structure

Build docs developers (and LLMs) love

Overview

Dataset

Preprocessing pipeline

Models implemented

Model performance

Model selection flow

API architecture

POST /predict

Running the project

Project structure