This project classifies emails as spam or legitimate (ham) using natural language processing techniques. Raw email text is transformed into numeric feature vectors using TF-IDF (Term Frequency–Inverse Document Frequency) vectorization, and a classifier is trained to distinguish spam patterns from normal communication. The project covers the full NLP classification pipeline from text cleaning through model evaluation.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Spam detection is a classic binary text classification task. The challenge is not just accuracy — a high false positive rate (legitimate emails classified as spam) has real consequences, so precision on the “ham” class is a key evaluation concern alongside overall accuracy. Problem type: Binary text classificationTarget variable:
label (0 = ham / legitimate, 1 = spam)Dataset: Labeled SMS/email message corpus
Primary technique: TF-IDF vectorization + classification
Dataset
The dataset consists of labeled text messages, where each record contains the message text and a binary label indicating whether it is spam or legitimate.| Column | Type | Description |
|---|---|---|
label | Binary (target) | 0 = ham (legitimate), 1 = spam |
message | Text | Raw email or SMS message content |
Text preprocessing pipeline
Before vectorization, raw message text goes through a cleaning pipeline:TF-IDF vectorization
TF-IDF converts each message into a sparse numeric vector where each dimension represents a term in the vocabulary. Terms that appear frequently in a specific message but rarely across all messages get high weights — this is what makes spam-specific vocabulary (e.g., “free”, “winner”, “click”) stand out.Models
Multiple classifiers are trained on the TF-IDF feature matrix and compared:- Naive Bayes
- Logistic Regression
- Support Vector Machine
Multinomial Naive Bayes is well-suited to text classification. It operates on word frequency counts and naturally handles the high-dimensional sparse feature space produced by TF-IDF.Strengths: Fast training, works well on small datasets, interpretable.
Model evaluation
Text classification models for spam detection are evaluated with an emphasis on precision and recall for the spam class:| Metric | Why it matters |
|---|---|
| Accuracy | Overall correctness; can be misleading on imbalanced data |
| Precision (spam) | Fraction of predicted spam that is actually spam — high precision means fewer false positives |
| Recall (spam) | Fraction of actual spam that was caught — high recall means fewer missed spam messages |
| F1 Score | Harmonic mean of precision and recall |
| ROC-AUC | Classifier performance across all probability thresholds |
| Model | Accuracy | Precision (spam) | Recall (spam) | F1 |
|---|---|---|---|---|
| Naive Bayes | ~97% | ~95% | ~93% | ~94% |
| Logistic Regression | ~98% | ~97% | ~94% | ~95% |
| Linear SVM | ~98% | ~98% | ~95% | ~96% |
Spam indicators
The top TF-IDF features with highest weight for the spam class typically include:- High-frequency spam terms: “free”, “win”, “winner”, “prize”, “cash”, “click”, “call now”
- Urgency signals: “limited time”, “act now”, “expires”
- Financial lures: “guaranteed”, “no risk”, “earn money”
Running the project
Run the full pipeline
Execute all cells in order. The notebook covers:
- Data loading and exploration
- Text cleaning and preprocessing
- TF-IDF vectorization
- Model training and comparison
- Confusion matrix and classification report
Project structure
This project is notebook-only and does not include a pre-built Flask API or serialized model files. Train the models in the notebook and serialize the pipeline with
joblib to use it for inference outside the notebook environment.