Every project in the 100-ML-AI-Project repository is built around the same end-to-end machine learning pipeline. This shared architecture makes each project self-contained and reproducible: data flows in one direction through well-defined stages, and each stage has a corresponding module in the project’sDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
src/ directory. Whether the project is a regression model for house prices or an NLP classifier, the same structural contract applies.
Pipeline overview
The standard ML pipeline progresses through seven sequential stages. Each stage consumes the output of the previous one and produces a well-typed artifact for the next.Pipeline stages
| Stage | Module | Output artifact |
|---|---|---|
| Input Data | Dataset/ | Raw CSV / image files |
| Preprocessing | src/Processing/preprocessing.py | Cleaned DataFrame |
| Feature Engineering | src/Processing/preprocessing.py | One-hot encoded NumPy array |
| Model Training | Project notebook | Serialized .pkl / .joblib model |
| Evaluation | Project notebook | R² score, accuracy, F1 metrics |
| Deployment API | src/App.py | Running Flask server on port 5000 |
| Prediction Output | src/Output/predictor.py | JSON response with prediction |
Input Data
Raw datasets are stored underDataset/ as CSV files, image folders, or structured tabular data. Each project ships with the dataset it was trained on so the full pipeline can be reproduced without external downloads.
Preprocessing
Thesrc/Processing/preprocessing.py module handles all data cleaning: dropping high-missing-value columns, filling gaps with statistical defaults, stripping irrelevant identifiers, and normalizing types. This stage converts raw data into a format that the feature engineering step can consume directly.
Feature Engineering
Categorical variables are one-hot encoded into binary columns. Boolean fields are cast to integers. The result is a fixed-width numeric vector whose column order matches the feature order seen during training. For inference, this transformation is applied to each incoming request.Model Training
Training runs inside the project’s Jupyter notebook (Project_Notebook.ipynb). Multiple algorithm variants are often trained and compared — for example, Linear Regression, Ridge, and Lasso in the House Price project — so that the best-performing or ensemble approach can be serialized to Models/.
Evaluation
Models are scored using task-appropriate metrics (R² for regression, accuracy/F1 for classification). Results are captured in the notebook and summarized in the project README, providing a reproducible audit trail.Deployment API
src/App.py wraps the trained model in a Flask REST API. It validates incoming requests, delegates to the preprocessing pipeline, runs inference through predictor.py, and returns a JSON response. CORS is enabled by default so web platforms can call the API directly.
Prediction Output
src/Output/predictor.py loads all serialized models from Models/ at startup and exposes a single function that returns the final prediction — often an average across multiple model predictions.
API flow
Once a project is deployed, client interactions follow a second pipeline that runs inside the server on every request:Walking through a complete pipeline execution
Send a prediction request
POST a natural-language prompt describing the input to the
/predict endpoint.Input validation
The API checks that a
prompt key is present in the request body and that GEMINI_API_KEY is set in the environment. Missing either returns an error before any processing occurs.Preprocessing and feature extraction
preprocess_prompt() in processing/preprocessing.py sends the natural-language prompt to the Gemini API, which returns a structured JSON object containing raw feature values. Missing values are filled with dataset defaults, categories are validated, and the result is one-hot encoded into a NumPy row vector.Model inference
predict_price() in output/predictor.py passes the feature vector to all three loaded models (Linear Regression, Ridge, Lasso) and returns the average of their predictions as a single float.