This project predicts residential house sale prices using an ensemble of three regression models: Linear Regression, Ridge (L2), and Lasso (L1). Rather than accepting a structured feature vector directly, the API accepts a natural language description of the property. A Gemini API layer parses that prompt into structured numeric features, which are then passed through a preprocessing pipeline and scored by all three models. The final prediction is the average of the three model outputs.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The project emphasizes learning the end-to-end machine learning workflow — from data cleaning and exploratory data analysis through model training, evaluation, and deployment — over maximizing raw predictive performance. All three models achieve similar R² scores (~0.604), reflecting that regularization had minimal impact at the chosen hyperparameters for this dataset. Problem type: RegressionTarget variable:
SalePriceDataset:
HousePricePrediction.csv (Kaggle)
Dataset
The dataset is a structured tabular housing dataset sourced from Kaggle. Key characteristics:- Target variable:
SalePrice(continuous, USD) - Dropped columns:
Id,YearRemodAdd,Exterior1st,BsmtFinSF2(irrelevant or high missing-value rate) - Train/test split: 80% / 20%
- Encoding: One-hot encoding for categorical features; boolean features converted to numeric
- Sale price distribution (histogram with KDE)
- Outlier detection via box plot
- Total basement area vs. sale price scatter plot
- Lot area vs. sale price scatter plot
Models implemented
Three regression models were trained and compared:| Model | Description |
|---|---|
| Linear Regression | Baseline model; easy to interpret |
| Ridge Regression (L2) | Penalizes large coefficients; reduces overfitting |
| Lasso Regression (L1) | Performs implicit feature selection; useful for high-dimensional data |
Model performance
| Model | R² Score |
|---|---|
| Linear Regression | ~0.604 |
| Ridge Regression | ~0.604 |
| Lasso Regression | ~0.604 |
Preprocessing pipeline
The preprocessing module (src/processing/preprocessing.py) handles the conversion from natural language input to a NumPy feature row:
- Receive the raw text prompt from the API
- Call the Gemini API to extract structured property features
- Parse the Gemini response into numeric values
- Apply one-hot encoding and boolean conversion
- Return a NumPy array row ready for model inference
Deployment flow
API endpoints
GET /
Health check endpoint. Returns a confirmation that the API is running. ResponsePOST /predict
Accepts a natural language property description and returns a predicted sale price. Request bodyA natural language description of the property (e.g., bedrooms, square footage, neighborhood quality). The Gemini layer extracts structured features from this text.
The predicted sale price in USD, computed as the average of Lasso, Linear Regression, and Ridge model predictions.
| Status | Condition |
|---|---|
400 | Missing prompt key in request body |
500 | GEMINI_API_KEY environment variable not set |
500 | Unexpected error during preprocessing or inference |