This project predicts individual medical insurance charges using a clean, structured dataset of demographic and health-related features. Four regression models are trained and compared — Linear Regression, Lasso (L1), Ridge (L2), and CatBoost — with all linear models achieving an R² of 0.858 and CatBoost providing a marginal improvement. The system is designed around a backend-first API architecture where a prediction endpoint handles structured JSON input and returns computed insurance charge estimates.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Medical cost prediction is a regression problem with a well-defined set of input features. The Kaggle insurance dataset is clean (no missing values), making it ideal for learning the full regression pipeline without extensive data imputation. Problem type: RegressionTarget variable:
charges (annual insurance cost, USD)Dataset:
insurance.csv (Kaggle Medical Insurance Dataset)
Key insights from EDA:
- Smoking status is the strongest predictor of insurance charges
- BMI and age show positive correlation with charges
- Region and sex have lower predictive impact
Dataset
| Feature | Type | Description |
|---|---|---|
age | Numeric | Age of the primary beneficiary |
sex | Categorical | Insurance contractor gender (male / female) |
bmi | Numeric | Body mass index |
children | Numeric | Number of dependents covered |
smoker | Categorical | Whether the beneficiary smokes (yes / no) |
region | Categorical | US residential region (northeast, southeast, southwest, northwest) |
charges | Numeric (target) | Individual medical costs billed by health insurance |
sex, smoker, region) are encoded before model training.
Preprocessing pipeline
Steps:- Encode
sex→{male: 0, female: 1} - Encode
smoker→{no: 0, yes: 1} - One-hot encode
region(drop first to avoid multicollinearity) - Apply StandardScaler (required for linear models)
- Train/test split
Models implemented
- Linear Regression
- Lasso (L1)
- Ridge (L2)
- CatBoost
The baseline model. Establishes a performance floor with no regularization penalty. Coefficients are fully interpretable — each unit change in a feature corresponds directly to a change in predicted charges.
Model performance
| Model | R² Score | MAE (USD) |
|---|---|---|
| Linear Regression | 0.858 | 2,759 |
| Lasso (L1) | 0.858 | 2,759 |
| Ridge (L2) | 0.858 | 2,759 |
| CatBoost | 0.860 | ~2,700 |
CatBoost provides a marginal improvement, but linear models remain efficient and interpretable. For production use cases where explainability matters (e.g., insurance pricing), Ridge or Lasso are preferred.
Model selection flow
API architecture
POST /predict
Accepts structured input fields and returns computed insurance charges. Request bodyAge of the primary beneficiary (years).
Gender of the beneficiary. Accepted values:
"male", "female".Body mass index. Typical range: 15.0–53.0.
Number of dependents covered by the insurance plan.
Smoking status. Accepted values:
"yes", "no".US residential region. Accepted values:
"northeast", "southeast", "southwest", "northwest".Estimated annual medical insurance cost in USD.
Running the project
This project is implemented as a self-contained Jupyter notebook. It does not include a separate Flask API layer. To serve predictions as a REST endpoint, train and serialize the model from the notebook, then wrap it in a Flask route following the pattern in the API Integration reference.