Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

Unsupervised learning finds structure in data without labelled examples. This section brings together four projects that cover the most practical unsupervised techniques: clustering customers by behaviour, detecting anomalies in telemetry streams, extracting hidden topics from document corpora, and recommending movies through collaborative filtering. Each project follows the repository’s standard modular pipeline and can be run end-to-end from the provided Jupyter notebook.

Project comparison

ProjectCategoryAlgorithmDatasetProblem Type
SmartCart Clustering (09)ClusteringK-Means + AgglomerativeSmartCart customer data (2 240 rows, 22 features)Customer segmentation
Anomaly Detection (24)Anomaly detectionIsolation Forest / statistical methodsTabular sensor / transaction dataOutlier identification
Document Topic Modelling (25)Topic modellingLDA / NMFText corpusLatent topic extraction
Movie Recommendation (22)RecommendationCollaborative filteringMovieLens-style ratingsPersonalised suggestions

What the project does

SmartCart Clustering segments a retail customer base into distinct behavioural groups so that marketing campaigns can be personalised per segment. Starting from a raw CRM export of 2 240 customers, the notebook engineers meaningful features, reduces dimensionality with PCA, and applies two clustering algorithms to discover natural groupings.

Algorithm used

K-Means (primary) with the optimal number of clusters selected automatically via the Elbow method and validated by the Silhouette score. A second pass uses Agglomerative Clustering (Ward linkage) on the same PCA-projected features, and the two label sets are compared visually in 3-D scatter plots.

Dataset / domain

dataset/SmartCartCustomers.csv – 2 240 rows × 22 columns including Income, Recency, spending columns (MntWines, MntFruits, MntMeatProducts, …), purchase-channel counts, Education, Marital_Status, and a campaign Response flag.

Key techniques

  • Feature engineering – derived Age from Year_Birth, Customer_Tenure_Days from Dt_Customer, Total_Spending by summing product categories, and Total_Children = Kidhome + Teenhome.
  • Preprocessing – median imputation for the 24 missing Income values; label consolidation for Education (Basic/2n Cycle → Undergraduate, Graduation → Graduate, Master/PhD → Postgraduate) and Marital_Status (Married/Together → Partner, others → Alone).
  • Outlier removal – customers with Age > 90 or Income > 600 000 are dropped (2 240 → 2 236 rows).
  • Encoding & scalingOneHotEncoder for categorical columns, then StandardScaler applied to the full feature matrix.
  • Dimensionality reductionPCA(n_components=3) retaining the first three principal components for clustering and visualisation.
  • Cluster validationKneeLocator from the kneed library pinpoints the elbow at k = 4; silhouette scores confirm this choice.

How to run

# 1. Install dependencies
pip install pandas matplotlib seaborn scikit-learn kneed

# 2. Launch the notebook
jupyter notebook SmartCartClusteringSystem.ipynb
The notebook is self-contained. The dataset must be placed at dataset/SmartCartCustomers.csv relative to the notebook file.

What the project does

This project identifies anomalous records — transactions, sensor readings, or log events that deviate significantly from the norm — without requiring labelled anomaly examples. It is designed for domains where anomalies are rare and costly (fraud, hardware failure, intrusion detection).

Algorithm used

Isolation Forest (primary unsupervised anomaly detector) alongside statistical boundary methods such as Z-score and IQR-based filtering. Isolation Forest isolates observations by randomly partitioning the feature space; anomalies require fewer splits and therefore receive lower anomaly scores.

Dataset / domain

Tabular data sourced from Kaggle (sensor telemetry or transactional records). The exact CSV is loaded via the standard dataset/ path in the project folder.

Key techniques

  • Unsupervised scoring – no labels needed; the model assigns a contamination-based anomaly score to every row.
  • Threshold tuning – the contamination hyperparameter controls the expected anomaly fraction and is adjusted based on domain knowledge.
  • Visualisation – anomalous points are highlighted in scatter and time-series plots to aid interpretability.
  • Pipeline integration – preprocessing (scaling, encoding) feeds directly into the Isolation Forest estimator following the repository’s standard SRC/Processing → model flow.

How to run

pip install pandas scikit-learn matplotlib seaborn

jupyter notebook  # open the notebook inside 24_Anomaly_Detection/
Place your dataset CSV in the dataset/ subdirectory and update the filename reference in the data-loading cell.

What the project does

Topic modelling discovers latent thematic structure in a collection of text documents. Given a corpus, the model automatically groups vocabulary into coherent topics and assigns each document a topic-mixture, enabling downstream tasks such as document search, summarisation, and content recommendation.

Algorithm used

Latent Dirichlet Allocation (LDA) (probabilistic generative model, implemented via sklearn.decomposition.LatentDirichletAllocation) and Non-negative Matrix Factorisation (NMF) (linear-algebraic decomposition via sklearn.decomposition.NMF). Both algorithms operate on a bag-of-words or TF-IDF representation of the corpus.

Dataset / domain

A text corpus sourced from Kaggle or Hugging Face Datasets (news articles, research abstracts, or product reviews). Documents are stored as raw text or CSV and loaded from the dataset/ directory.

Key techniques

  • Text preprocessing – tokenisation, stop-word removal, lemmatisation with NLTK or spaCy.
  • VectorisationCountVectorizer (for LDA) and TfidfVectorizer (for NMF) from scikit-learn.
  • Hyperparameter search – number of topics (n_components) evaluated by perplexity (LDA) or reconstruction error (NMF).
  • Topic coherence – top-N words per topic displayed and optionally scored with gensim’s coherence metrics.
  • Document–topic matrix – each document is represented as a distribution over discovered topics.

How to run

pip install pandas scikit-learn nltk matplotlib

# Download NLTK assets
python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')"

jupyter notebook  # open the notebook inside 25_Document_Topic_Modelling/

What the project does

This project builds a personalised movie recommender that predicts the rating a user would give to an unseen film and ranks candidates accordingly. It mirrors the core engine behind streaming-platform suggestion carousels.

Algorithm used

Collaborative Filtering — specifically user-based or item-based similarity computed via cosine distance or Pearson correlation on the user–item rating matrix. An optional matrix-factorisation variant (SVD via surprise or scipy) decomposes the sparse rating matrix into latent user and item embeddings.

Dataset / domain

A MovieLens-style ratings dataset (userId, movieId, rating, timestamp) sourced from Kaggle. The dataset is stored in the dataset/ subdirectory.

Key techniques

  • Sparse matrix construction – pivot the ratings DataFrame into a user × movie matrix; missing values represent unrated films.
  • Similarity computation – cosine similarity between user vectors (user-based CF) or between item vectors (item-based CF).
  • Top-N generation – for a target user, retrieve the K most similar users, aggregate their unseen ratings, and return the highest-predicted movies.
  • Evaluation – RMSE and MAE on a held-out test split.
  • Cold-start handling – popularity-based fallback for new users with no rating history.

How to run

pip install pandas numpy scikit-learn scipy matplotlib

jupyter notebook  # open the notebook inside 22_Movies_Recommendation_System/
All four projects follow the same repository structure. Place data in dataset/, run preprocessing through SRC/Processing/, and inspect trained artefacts in Models/. Refer to the Project Structure page for the canonical folder layout.

Build docs developers (and LLMs) love