Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
The 100-ML-AI-Project repository spans supervised regression, classification, NLP, computer vision, reinforcement learning, and generative AI. Each project ships with the dataset it was trained on, stored in a Dataset/ subdirectory alongside the model and source code. Datasets are sourced from three platforms, each suited to a different problem domain. Understanding where datasets come from and how to access them is the first step toward reproducing any project from scratch.
| Platform | Use Case | Example Datasets | Access Method |
|---|
| Kaggle | Tabular regression and classification (structured data) | Housing prices, employee churn, Titanic survival, medical costs | Kaggle API (kaggle datasets download) or browser download |
| Hugging Face Datasets | NLP tasks: text classification, sentiment, emotion detection | dair-ai/emotion, sentiment analysis corpora | datasets Python library (load_dataset) |
| Public ML Repositories | Image benchmarks and vision tasks | CIFAR-10, MNIST, food image collections | Direct download URLs, torchvision.datasets, tensorflow.keras.datasets |
Most Kaggle datasets require a free Kaggle account and acceptance of the dataset’s individual license terms before download. Some competition datasets additionally require joining the specific competition. Check each dataset’s license before using it in any published work. Hugging Face datasets are generally available under open licenses (Apache 2.0, CC BY 4.0) but vary by dataset — always review the dataset card.
Datasets by project category
Supervised Learning (projects 01–29)
Supervised projects use tabular CSV datasets from Kaggle. Each file contains labeled rows suitable for regression or binary/multi-class classification.
| Project | Dataset | Problem type |
|---|
| 01 House Price Prediction | HousePricePrediction.csv (Kaggle) | Regression (SalePrice) |
| 02 Employee Retention Prediction | Employee churn dataset (Kaggle) | Classification |
| 04 Medical Cost Prediction | Medical insurance costs (Kaggle) | Regression |
| 05 Titanic Survival Prediction | Titanic passenger manifest (Kaggle) | Binary classification |
| 06 Email Spam Classification | Email corpus with spam labels (Kaggle) | Binary classification |
| 10 Used Car Price Prediction | Used car listings (Kaggle) | Regression |
| 11 Mobile Price Range Prediction | Mobile phone specs (Kaggle) | Multi-class classification |
| 13 Hotel Booking Cancellation | Hotel booking records (Kaggle) | Binary classification |
| 14 Crop Yield Prediction | Agricultural yield data (Kaggle) | Regression |
| 26 Credit Loan Approval | Loan application records (Kaggle) | Binary classification |
Natural Language Processing (projects 16, 40–41)
NLP projects load datasets from Hugging Face, which provides versioned, pre-split datasets accessible through the datasets library without any account requirement.
| Project | Dataset | Source |
|---|
| 16 Text Emotion Detection | dair-ai/emotion (6-class emotion labels) | Hugging Face |
| 40 Resume Keyword Extractor | Resume text corpus | Hugging Face / custom |
| 41 Sentiment Analysis | Sentiment-labeled review dataset | Hugging Face |
Computer Vision & Deep Learning (projects 12, 30–34)
Vision projects use standard image benchmark datasets. Many are available directly through deep learning framework dataset utilities without a manual download step.
| Project | Dataset | Source |
|---|
| 30 Binary Image Classification | Custom binary image dataset | Public repository |
| 31 Food Image Classification | Food-101 or similar food image dataset | Public repository |
| 32 CIFAR-10 Classification | CIFAR-10 (60,000 32×32 colour images, 10 classes) | tensorflow.keras.datasets / torchvision |
| 33 MNIST Digit Classification | MNIST (70,000 28×28 greyscale digit images) | tensorflow.keras.datasets / torchvision |
| 12 Date Fruit Classification | Date fruit image dataset | Public repository |
Downloading from Kaggle API
Install the Kaggle CLI and place your kaggle.json API token in ~/.kaggle/ before running any download command.
# Install the Kaggle CLI
pip install kaggle
# Place your API token (download from kaggle.com → Account → API)
mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
# Download the House Price Prediction dataset
kaggle datasets download -d competitions/house-prices-advanced-regression-techniques
# Unzip into the project's Dataset directory
unzip house-prices-advanced-regression-techniques.zip \
-d ML_To_Train/01_House_Price_Predict/dataset/
Loading from Hugging Face
from datasets import load_dataset
# Load the emotion dataset used in Project 16
dataset = load_dataset("dair-ai/emotion")
# Access splits
train_data = dataset["train"]
test_data = dataset["test"]
print(train_data.features)
# {'text': Value('string'), 'label': ClassLabel(names=['sadness', 'joy', ...])}
Loading vision datasets via framework utilities
# MNIST via TensorFlow / Keras
import tensorflow as tf
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
# CIFAR-10 via PyTorch
import torchvision
import torchvision.transforms as transforms
trainset = torchvision.datasets.CIFAR10(
root="./data", train=True, download=True,
transform=transforms.ToTensor()
)