Unsupervised learning finds structure in data without labelled examples. This section brings together four projects that cover the most practical unsupervised techniques: clustering customers by behaviour, detecting anomalies in telemetry streams, extracting hidden topics from document corpora, and recommending movies through collaborative filtering. Each project follows the repository’s standard modular pipeline and can be run end-to-end from the provided Jupyter notebook.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
Project comparison
| Project | Category | Algorithm | Dataset | Problem Type |
|---|---|---|---|---|
| SmartCart Clustering (09) | Clustering | K-Means + Agglomerative | SmartCart customer data (2 240 rows, 22 features) | Customer segmentation |
| Anomaly Detection (24) | Anomaly detection | Isolation Forest / statistical methods | Tabular sensor / transaction data | Outlier identification |
| Document Topic Modelling (25) | Topic modelling | LDA / NMF | Text corpus | Latent topic extraction |
| Movie Recommendation (22) | Recommendation | Collaborative filtering | MovieLens-style ratings | Personalised suggestions |
09 – SmartCart Clustering System
09 – SmartCart Clustering System
What the project does
SmartCart Clustering segments a retail customer base into distinct behavioural groups so that marketing campaigns can be personalised per segment. Starting from a raw CRM export of 2 240 customers, the notebook engineers meaningful features, reduces dimensionality with PCA, and applies two clustering algorithms to discover natural groupings.Algorithm used
K-Means (primary) with the optimal number of clusters selected automatically via the Elbow method and validated by the Silhouette score. A second pass uses Agglomerative Clustering (Ward linkage) on the same PCA-projected features, and the two label sets are compared visually in 3-D scatter plots.Dataset / domain
dataset/SmartCartCustomers.csv – 2 240 rows × 22 columns including Income, Recency, spending columns (MntWines, MntFruits, MntMeatProducts, …), purchase-channel counts, Education, Marital_Status, and a campaign Response flag.Key techniques
- Feature engineering – derived
AgefromYear_Birth,Customer_Tenure_DaysfromDt_Customer,Total_Spendingby summing product categories, andTotal_Children=Kidhome+Teenhome. - Preprocessing – median imputation for the 24 missing
Incomevalues; label consolidation forEducation(Basic/2n Cycle → Undergraduate, Graduation → Graduate, Master/PhD → Postgraduate) andMarital_Status(Married/Together → Partner, others → Alone). - Outlier removal – customers with
Age > 90orIncome > 600 000are dropped (2 240 → 2 236 rows). - Encoding & scaling –
OneHotEncoderfor categorical columns, thenStandardScalerapplied to the full feature matrix. - Dimensionality reduction –
PCA(n_components=3)retaining the first three principal components for clustering and visualisation. - Cluster validation –
KneeLocatorfrom thekneedlibrary pinpoints the elbow at k = 4; silhouette scores confirm this choice.
How to run
dataset/SmartCartCustomers.csv relative to the notebook file.24 – Anomaly Detection
24 – Anomaly Detection
What the project does
This project identifies anomalous records — transactions, sensor readings, or log events that deviate significantly from the norm — without requiring labelled anomaly examples. It is designed for domains where anomalies are rare and costly (fraud, hardware failure, intrusion detection).Algorithm used
Isolation Forest (primary unsupervised anomaly detector) alongside statistical boundary methods such as Z-score and IQR-based filtering. Isolation Forest isolates observations by randomly partitioning the feature space; anomalies require fewer splits and therefore receive lower anomaly scores.Dataset / domain
Tabular data sourced from Kaggle (sensor telemetry or transactional records). The exact CSV is loaded via the standarddataset/ path in the project folder.Key techniques
- Unsupervised scoring – no labels needed; the model assigns a contamination-based anomaly score to every row.
- Threshold tuning – the
contaminationhyperparameter controls the expected anomaly fraction and is adjusted based on domain knowledge. - Visualisation – anomalous points are highlighted in scatter and time-series plots to aid interpretability.
- Pipeline integration – preprocessing (scaling, encoding) feeds directly into the Isolation Forest estimator following the repository’s standard
SRC/Processing→ model flow.
How to run
dataset/ subdirectory and update the filename reference in the data-loading cell.25 – Document Topic Modelling
25 – Document Topic Modelling
What the project does
Topic modelling discovers latent thematic structure in a collection of text documents. Given a corpus, the model automatically groups vocabulary into coherent topics and assigns each document a topic-mixture, enabling downstream tasks such as document search, summarisation, and content recommendation.Algorithm used
Latent Dirichlet Allocation (LDA) (probabilistic generative model, implemented viasklearn.decomposition.LatentDirichletAllocation) and Non-negative Matrix Factorisation (NMF) (linear-algebraic decomposition via sklearn.decomposition.NMF). Both algorithms operate on a bag-of-words or TF-IDF representation of the corpus.Dataset / domain
A text corpus sourced from Kaggle or Hugging Face Datasets (news articles, research abstracts, or product reviews). Documents are stored as raw text or CSV and loaded from thedataset/ directory.Key techniques
- Text preprocessing – tokenisation, stop-word removal, lemmatisation with
NLTKorspaCy. - Vectorisation –
CountVectorizer(for LDA) andTfidfVectorizer(for NMF) from scikit-learn. - Hyperparameter search – number of topics (
n_components) evaluated by perplexity (LDA) or reconstruction error (NMF). - Topic coherence – top-N words per topic displayed and optionally scored with
gensim’s coherence metrics. - Document–topic matrix – each document is represented as a distribution over discovered topics.
How to run
22 – Movie Recommendation System
22 – Movie Recommendation System
What the project does
This project builds a personalised movie recommender that predicts the rating a user would give to an unseen film and ranks candidates accordingly. It mirrors the core engine behind streaming-platform suggestion carousels.Algorithm used
Collaborative Filtering — specifically user-based or item-based similarity computed via cosine distance or Pearson correlation on the user–item rating matrix. An optional matrix-factorisation variant (SVD viasurprise or scipy) decomposes the sparse rating matrix into latent user and item embeddings.Dataset / domain
A MovieLens-style ratings dataset (userId, movieId, rating, timestamp) sourced from Kaggle. The dataset is stored in thedataset/ subdirectory.Key techniques
- Sparse matrix construction – pivot the ratings DataFrame into a user × movie matrix; missing values represent unrated films.
- Similarity computation – cosine similarity between user vectors (user-based CF) or between item vectors (item-based CF).
- Top-N generation – for a target user, retrieve the K most similar users, aggregate their unseen ratings, and return the highest-predicted movies.
- Evaluation – RMSE and MAE on a held-out test split.
- Cold-start handling – popularity-based fallback for new users with no rating history.