Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

This project predicts residential house sale prices using an ensemble of three regression models: Linear Regression, Ridge (L2), and Lasso (L1). Rather than accepting a structured feature vector directly, the API accepts a natural language description of the property. A Gemini API layer parses that prompt into structured numeric features, which are then passed through a preprocessing pipeline and scored by all three models. The final prediction is the average of the three model outputs.

Overview

The project emphasizes learning the end-to-end machine learning workflow — from data cleaning and exploratory data analysis through model training, evaluation, and deployment — over maximizing raw predictive performance. All three models achieve similar R² scores (~0.604), reflecting that regularization had minimal impact at the chosen hyperparameters for this dataset. Problem type: Regression
Target variable: SalePrice
Dataset: HousePricePrediction.csv (Kaggle)

Dataset

The dataset is a structured tabular housing dataset sourced from Kaggle. Key characteristics:
  • Target variable: SalePrice (continuous, USD)
  • Dropped columns: Id, YearRemodAdd, Exterior1st, BsmtFinSF2 (irrelevant or high missing-value rate)
  • Train/test split: 80% / 20%
  • Encoding: One-hot encoding for categorical features; boolean features converted to numeric
Exploratory data analysis included:
  • Sale price distribution (histogram with KDE)
  • Outlier detection via box plot
  • Total basement area vs. sale price scatter plot
  • Lot area vs. sale price scatter plot

Models implemented

Three regression models were trained and compared:
ModelDescription
Linear RegressionBaseline model; easy to interpret
Ridge Regression (L2)Penalizes large coefficients; reduces overfitting
Lasso Regression (L1)Performs implicit feature selection; useful for high-dimensional data

Model performance

ModelR² Score
Linear Regression~0.604
Ridge Regression~0.604
Lasso Regression~0.604
The similar scores indicate that regularization had minimal impact at the chosen hyperparameters. The final prediction averages all three model outputs to produce a single price estimate.

Preprocessing pipeline

The preprocessing module (src/processing/preprocessing.py) handles the conversion from natural language input to a NumPy feature row:
  1. Receive the raw text prompt from the API
  2. Call the Gemini API to extract structured property features
  3. Parse the Gemini response into numeric values
  4. Apply one-hot encoding and boolean conversion
  5. Return a NumPy array row ready for model inference

Deployment flow

API endpoints

GET /

Health check endpoint. Returns a confirmation that the API is running. Response
{"message": "House Price Prediction API is running"}

POST /predict

Accepts a natural language property description and returns a predicted sale price. Request body
prompt
string
required
A natural language description of the property (e.g., bedrooms, square footage, neighborhood quality). The Gemini layer extracts structured features from this text.
Example request
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "3 bedroom house, 2000 sqft, good neighborhood"}'
Example response
{"predicted_sale_price": 185000}
Response fields
predicted_sale_price
integer
The predicted sale price in USD, computed as the average of Lasso, Linear Regression, and Ridge model predictions.
Error responses
StatusCondition
400Missing prompt key in request body
500GEMINI_API_KEY environment variable not set
500Unexpected error during preprocessing or inference

Running the project

1

Clone and install dependencies

cd ML_To_Train/01_House_Price_Predict
pip install -r requirements.txt
2

Set the Gemini API key

export GEMINI_API_KEY="your-api-key-here"
The API will return a 500 error if this variable is not set.
3

Start the Flask server

cd src
python app.py
The server starts on http://localhost:5000 in debug mode.
4

Send a prediction request

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "3 bedroom house, 2000 sqft, good neighborhood"}'
The Jupyter notebook House_Price_Prediction.ipynb walks through the full EDA and model training workflow if you want to retrain the models or adjust hyperparameters before running the API.

Project structure

01_House_Price_Predict/

├── dataset/
│   └── HousePricePrediction.csv

├── model/
│   ├── lasso_model.pkl
│   ├── lr_model.pkl
│   └── ridge_model.pkl

├── src/
│   ├── app.py                        # Flask API entry point
│   ├── environment.py
│   ├── processing/
│   │   └── preprocessing.py          # Gemini + feature extraction
│   └── output/
│       └── predictor.py              # Loads models, averages predictions

├── House_Price_Prediction.ipynb
├── requirements.txt
└── README.md

Build docs developers (and LLMs) love