Unsupervised learning: clustering, anomaly detection

Project	Category	Algorithm	Dataset	Problem Type
SmartCart Clustering (09)	Clustering	K-Means + Agglomerative	SmartCart customer data (2 240 rows, 22 features)	Customer segmentation
Anomaly Detection (24)	Anomaly detection	Isolation Forest / statistical methods	Tabular sensor / transaction data	Outlier identification
Document Topic Modelling (25)	Topic modelling	LDA / NMF	Text corpus	Latent topic extraction
Movie Recommendation (22)	Recommendation	Collaborative filtering	MovieLens-style ratings	Personalised suggestions

09 – SmartCart Clustering System

What the project does

SmartCart Clustering segments a retail customer base into distinct behavioural groups so that marketing campaigns can be personalised per segment. Starting from a raw CRM export of 2 240 customers, the notebook engineers meaningful features, reduces dimensionality with PCA, and applies two clustering algorithms to discover natural groupings.

Algorithm used

K-Means (primary) with the optimal number of clusters selected automatically via the Elbow method and validated by the Silhouette score. A second pass uses Agglomerative Clustering (Ward linkage) on the same PCA-projected features, and the two label sets are compared visually in 3-D scatter plots.

Dataset / domain

dataset/SmartCartCustomers.csv – 2 240 rows × 22 columns including Income, Recency, spending columns (MntWines, MntFruits, MntMeatProducts, …), purchase-channel counts, Education, Marital_Status, and a campaign Response flag.

Key techniques

Feature engineering – derived Age from Year_Birth, Customer_Tenure_Days from Dt_Customer, Total_Spending by summing product categories, and Total_Children = Kidhome + Teenhome.
Preprocessing – median imputation for the 24 missing Income values; label consolidation for Education (Basic/2n Cycle → Undergraduate, Graduation → Graduate, Master/PhD → Postgraduate) and Marital_Status (Married/Together → Partner, others → Alone).
Outlier removal – customers with Age > 90 or Income > 600 000 are dropped (2 240 → 2 236 rows).
Encoding & scaling – OneHotEncoder for categorical columns, then StandardScaler applied to the full feature matrix.
Dimensionality reduction – PCA(n_components=3) retaining the first three principal components for clustering and visualisation.
Cluster validation – KneeLocator from the kneed library pinpoints the elbow at k = 4; silhouette scores confirm this choice.

How to run

# 1. Install dependencies
pip install pandas matplotlib seaborn scikit-learn kneed

# 2. Launch the notebook
jupyter notebook SmartCartClusteringSystem.ipynb

The notebook is self-contained. The dataset must be placed at dataset/SmartCartCustomers.csv relative to the notebook file.

24 – Anomaly Detection

What the project does

This project identifies anomalous records — transactions, sensor readings, or log events that deviate significantly from the norm — without requiring labelled anomaly examples. It is designed for domains where anomalies are rare and costly (fraud, hardware failure, intrusion detection).

Algorithm used

Isolation Forest (primary unsupervised anomaly detector) alongside statistical boundary methods such as Z-score and IQR-based filtering. Isolation Forest isolates observations by randomly partitioning the feature space; anomalies require fewer splits and therefore receive lower anomaly scores.

Dataset / domain

Tabular data sourced from Kaggle (sensor telemetry or transactional records). The exact CSV is loaded via the standard dataset/ path in the project folder.

Key techniques

Unsupervised scoring – no labels needed; the model assigns a contamination-based anomaly score to every row.
Threshold tuning – the contamination hyperparameter controls the expected anomaly fraction and is adjusted based on domain knowledge.
Visualisation – anomalous points are highlighted in scatter and time-series plots to aid interpretability.
Pipeline integration – preprocessing (scaling, encoding) feeds directly into the Isolation Forest estimator following the repository’s standard SRC/Processing → model flow.

How to run

pip install pandas scikit-learn matplotlib seaborn

jupyter notebook  # open the notebook inside 24_Anomaly_Detection/

Place your dataset CSV in the dataset/ subdirectory and update the filename reference in the data-loading cell.

25 – Document Topic Modelling

What the project does

Topic modelling discovers latent thematic structure in a collection of text documents. Given a corpus, the model automatically groups vocabulary into coherent topics and assigns each document a topic-mixture, enabling downstream tasks such as document search, summarisation, and content recommendation.

Algorithm used

Latent Dirichlet Allocation (LDA) (probabilistic generative model, implemented via sklearn.decomposition.LatentDirichletAllocation) and Non-negative Matrix Factorisation (NMF) (linear-algebraic decomposition via sklearn.decomposition.NMF). Both algorithms operate on a bag-of-words or TF-IDF representation of the corpus.

Dataset / domain

A text corpus sourced from Kaggle or Hugging Face Datasets (news articles, research abstracts, or product reviews). Documents are stored as raw text or CSV and loaded from the dataset/ directory.

Key techniques

Text preprocessing – tokenisation, stop-word removal, lemmatisation with NLTK or spaCy.
Vectorisation – CountVectorizer (for LDA) and TfidfVectorizer (for NMF) from scikit-learn.
Hyperparameter search – number of topics (n_components) evaluated by perplexity (LDA) or reconstruction error (NMF).
Topic coherence – top-N words per topic displayed and optionally scored with gensim’s coherence metrics.
Document–topic matrix – each document is represented as a distribution over discovered topics.

How to run

pip install pandas scikit-learn nltk matplotlib

# Download NLTK assets
python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')"

jupyter notebook  # open the notebook inside 25_Document_Topic_Modelling/

22 – Movie Recommendation System

What the project does

This project builds a personalised movie recommender that predicts the rating a user would give to an unseen film and ranks candidates accordingly. It mirrors the core engine behind streaming-platform suggestion carousels.

Algorithm used

Collaborative Filtering — specifically user-based or item-based similarity computed via cosine distance or Pearson correlation on the user–item rating matrix. An optional matrix-factorisation variant (SVD via surprise or scipy) decomposes the sparse rating matrix into latent user and item embeddings.

Dataset / domain

A MovieLens-style ratings dataset (userId, movieId, rating, timestamp) sourced from Kaggle. The dataset is stored in the dataset/ subdirectory.

Key techniques

Sparse matrix construction – pivot the ratings DataFrame into a user × movie matrix; missing values represent unrated films.
Similarity computation – cosine similarity between user vectors (user-based CF) or between item vectors (item-based CF).
Top-N generation – for a target user, retrieve the K most similar users, aggregate their unseen ratings, and return the highest-predicted movies.
Evaluation – RMSE and MAE on a held-out test split.
Cold-start handling – popularity-based fallback for new users with no rating history.

How to run

pip install pandas numpy scikit-learn scipy matplotlib

jupyter notebook  # open the notebook inside 22_Movies_Recommendation_System/

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Unsupervised learning: clustering, anomaly detection

Project comparison

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Project comparison

​What the project does

​Algorithm used

​Dataset / domain

​Key techniques

​How to run

​What the project does

​Algorithm used

​Dataset / domain

​Key techniques

​How to run

​What the project does

​Algorithm used

​Dataset / domain

​Key techniques

​How to run

​What the project does

​Algorithm used

​Dataset / domain

​Key techniques

​How to run

Build docs developers (and LLMs) love

Project comparison

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run

What the project does

Algorithm used

Dataset / domain

Key techniques

How to run