House price prediction with an ensemble of regressors

This project predicts residential house sale prices using an ensemble of three regression models: Linear Regression, Ridge (L2), and Lasso (L1). Rather than accepting a structured feature vector directly, the API accepts a natural language description of the property. A Gemini API layer parses that prompt into structured numeric features, which are then passed through a preprocessing pipeline and scored by all three models. The final prediction is the average of the three model outputs.

Overview

The project emphasizes learning the end-to-end machine learning workflow — from data cleaning and exploratory data analysis through model training, evaluation, and deployment — over maximizing raw predictive performance. All three models achieve similar R² scores (~0.604), reflecting that regularization had minimal impact at the chosen hyperparameters for this dataset. Problem type: Regression
Target variable: SalePrice
Dataset: HousePricePrediction.csv (Kaggle)

Dataset

The dataset is a structured tabular housing dataset sourced from Kaggle. Key characteristics:

Target variable: SalePrice (continuous, USD)
Dropped columns: Id, YearRemodAdd, Exterior1st, BsmtFinSF2 (irrelevant or high missing-value rate)
Train/test split: 80% / 20%
Encoding: One-hot encoding for categorical features; boolean features converted to numeric

Exploratory data analysis included:

Sale price distribution (histogram with KDE)
Outlier detection via box plot
Total basement area vs. sale price scatter plot
Lot area vs. sale price scatter plot

Models implemented

Three regression models were trained and compared:

Model	Description
Linear Regression	Baseline model; easy to interpret
Ridge Regression (L2)	Penalizes large coefficients; reduces overfitting
Lasso Regression (L1)	Performs implicit feature selection; useful for high-dimensional data

Model performance

Model	R² Score
Linear Regression	~0.604
Ridge Regression	~0.604
Lasso Regression	~0.604

The similar scores indicate that regularization had minimal impact at the chosen hyperparameters. The final prediction averages all three model outputs to produce a single price estimate.

Preprocessing pipeline

The preprocessing module (src/processing/preprocessing.py) handles the conversion from natural language input to a NumPy feature row:

Receive the raw text prompt from the API
Call the Gemini API to extract structured property features
Parse the Gemini response into numeric values
Apply one-hot encoding and boolean conversion
Return a NumPy array row ready for model inference

Deployment flow

API endpoints

GET /

Health check endpoint. Returns a confirmation that the API is running. Response

{"message": "House Price Prediction API is running"}

POST /predict

Accepts a natural language property description and returns a predicted sale price. Request body

prompt

string

required

A natural language description of the property (e.g., bedrooms, square footage, neighborhood quality). The Gemini layer extracts structured features from this text.

Example request

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "3 bedroom house, 2000 sqft, good neighborhood"}'

Example response

{"predicted_sale_price": 185000}

Response fields

predicted_sale_price

integer

The predicted sale price in USD, computed as the average of Lasso, Linear Regression, and Ridge model predictions.

Error responses

Status	Condition
`400`	Missing `prompt` key in request body
`500`	`GEMINI_API_KEY` environment variable not set
`500`	Unexpected error during preprocessing or inference

Running the project

Clone and install dependencies

cd ML_To_Train/01_House_Price_Predict
pip install -r requirements.txt

Set the Gemini API key

export GEMINI_API_KEY="your-api-key-here"

The API will return a 500 error if this variable is not set.

Start the Flask server

cd src
python app.py

The server starts on http://localhost:5000 in debug mode.

Send a prediction request

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "3 bedroom house, 2000 sqft, good neighborhood"}'

The Jupyter notebook House_Price_Prediction.ipynb walks through the full EDA and model training workflow if you want to retrain the models or adjust hyperparameters before running the API.

Project structure

01_House_Price_Predict/
│
├── dataset/
│   └── HousePricePrediction.csv
│
├── model/
│   ├── lasso_model.pkl
│   ├── lr_model.pkl
│   └── ridge_model.pkl
│
├── src/
│   ├── app.py                        # Flask API entry point
│   ├── environment.py
│   ├── processing/
│   │   └── preprocessing.py          # Gemini + feature extraction
│   └── output/
│       └── predictor.py              # Loads models, averages predictions
│
├── House_Price_Prediction.ipynb
├── requirements.txt
└── README.md

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

House price prediction with an ensemble of regressors

Overview

Dataset

Models implemented

Model performance

Preprocessing pipeline

Deployment flow

API endpoints

GET /

POST /predict

Running the project

Project structure

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Overview

​Dataset

​Models implemented

​Model performance

​Preprocessing pipeline

​Deployment flow

​API endpoints

​GET /

​POST /predict

​Running the project

​Project structure

Build docs developers (and LLMs) love

Overview

Dataset

Models implemented

Model performance

Preprocessing pipeline

Deployment flow

API endpoints

GET /

POST /predict

Running the project

Project structure