Standard ML pipeline architecture across all projects

Every project in the 100-ML-AI-Project repository is built around the same end-to-end machine learning pipeline. This shared architecture makes each project self-contained and reproducible: data flows in one direction through well-defined stages, and each stage has a corresponding module in the project’s src/ directory. Whether the project is a regression model for house prices or an NLP classifier, the same structural contract applies.

Pipeline overview

The standard ML pipeline progresses through seven sequential stages. Each stage consumes the output of the previous one and produces a well-typed artifact for the next.

Pipeline stages

Stage	Module	Output artifact
Input Data	`Dataset/`	Raw CSV / image files
Preprocessing	`src/Processing/preprocessing.py`	Cleaned DataFrame
Feature Engineering	`src/Processing/preprocessing.py`	One-hot encoded NumPy array
Model Training	Project notebook	Serialized `.pkl` / `.joblib` model
Evaluation	Project notebook	R² score, accuracy, F1 metrics
Deployment API	`src/App.py`	Running Flask server on port 5000
Prediction Output	`src/Output/predictor.py`	JSON response with prediction

Input Data

Raw datasets are stored under Dataset/ as CSV files, image folders, or structured tabular data. Each project ships with the dataset it was trained on so the full pipeline can be reproduced without external downloads.

Preprocessing

The src/Processing/preprocessing.py module handles all data cleaning: dropping high-missing-value columns, filling gaps with statistical defaults, stripping irrelevant identifiers, and normalizing types. This stage converts raw data into a format that the feature engineering step can consume directly.

Feature Engineering

Categorical variables are one-hot encoded into binary columns. Boolean fields are cast to integers. The result is a fixed-width numeric vector whose column order matches the feature order seen during training. For inference, this transformation is applied to each incoming request.

Model Training

Training runs inside the project’s Jupyter notebook (Project_Notebook.ipynb). Multiple algorithm variants are often trained and compared — for example, Linear Regression, Ridge, and Lasso in the House Price project — so that the best-performing or ensemble approach can be serialized to Models/.

Evaluation

Models are scored using task-appropriate metrics (R² for regression, accuracy/F1 for classification). Results are captured in the notebook and summarized in the project README, providing a reproducible audit trail.

Deployment API

src/App.py wraps the trained model in a Flask REST API. It validates incoming requests, delegates to the preprocessing pipeline, runs inference through predictor.py, and returns a JSON response. CORS is enabled by default so web platforms can call the API directly.

Prediction Output

src/Output/predictor.py loads all serialized models from Models/ at startup and exposes a single function that returns the final prediction — often an average across multiple model predictions.

API flow

Once a project is deployed, client interactions follow a second pipeline that runs inside the server on every request:

Walking through a complete pipeline execution

Start the Flask server

Run the application entry point. Flask starts on port 5000 by default.

python src/app.py

Send a prediction request

POST a natural-language prompt describing the input to the /predict endpoint.

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "3-bedroom house built in 1990, lot area 8500 sq ft, RL zoning"}'

Input validation

The API checks that a prompt key is present in the request body and that GEMINI_API_KEY is set in the environment. Missing either returns an error before any processing occurs.

Preprocessing and feature extraction

preprocess_prompt() in processing/preprocessing.py sends the natural-language prompt to the Gemini API, which returns a structured JSON object containing raw feature values. Missing values are filled with dataset defaults, categories are validated, and the result is one-hot encoded into a NumPy row vector.

Model inference

predict_price() in output/predictor.py passes the feature vector to all three loaded models (Linear Regression, Ridge, Lasso) and returns the average of their predictions as a single float.

Response

The Flask route serializes the prediction to JSON and returns it to the caller.

{
  "predicted_sale_price": 185000
}

ML Pipeline

Resources

Standard ML pipeline architecture across all projects

Pipeline overview

Pipeline stages

Input Data

Preprocessing

Feature Engineering

Model Training

Evaluation

Deployment API

Prediction Output

API flow

Walking through a complete pipeline execution

Build docs developers (and LLMs) love

ML Pipeline

Resources

Documentation Index

​Pipeline overview

​Pipeline stages

​Input Data

​Preprocessing

​Feature Engineering

​Model Training

​Evaluation

​Deployment API

​Prediction Output

​API flow

​Walking through a complete pipeline execution

Build docs developers (and LLMs) love

Pipeline overview

Pipeline stages

Input Data

Preprocessing

Feature Engineering

Model Training

Evaluation

Deployment API

Prediction Output

API flow

Walking through a complete pipeline execution