Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

This project classifies emails as spam or legitimate (ham) using natural language processing techniques. Raw email text is transformed into numeric feature vectors using TF-IDF (Term Frequency–Inverse Document Frequency) vectorization, and a classifier is trained to distinguish spam patterns from normal communication. The project covers the full NLP classification pipeline from text cleaning through model evaluation.

Overview

Spam detection is a classic binary text classification task. The challenge is not just accuracy — a high false positive rate (legitimate emails classified as spam) has real consequences, so precision on the “ham” class is a key evaluation concern alongside overall accuracy. Problem type: Binary text classification
Target variable: label (0 = ham / legitimate, 1 = spam)
Dataset: Labeled SMS/email message corpus
Primary technique: TF-IDF vectorization + classification

Dataset

The dataset consists of labeled text messages, where each record contains the message text and a binary label indicating whether it is spam or legitimate.
ColumnTypeDescription
labelBinary (target)0 = ham (legitimate), 1 = spam
messageTextRaw email or SMS message content
Class distribution is imbalanced — legitimate messages typically make up 85–87% of the dataset. This imbalance influences model selection and threshold tuning.

Text preprocessing pipeline

Before vectorization, raw message text goes through a cleaning pipeline:

TF-IDF vectorization

TF-IDF converts each message into a sparse numeric vector where each dimension represents a term in the vocabulary. Terms that appear frequently in a specific message but rarely across all messages get high weights — this is what makes spam-specific vocabulary (e.g., “free”, “winner”, “click”) stand out.
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),   # unigrams and bigrams
    stop_words='english'
)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Models

Multiple classifiers are trained on the TF-IDF feature matrix and compared:
Multinomial Naive Bayes is well-suited to text classification. It operates on word frequency counts and naturally handles the high-dimensional sparse feature space produced by TF-IDF.
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)
Strengths: Fast training, works well on small datasets, interpretable.

Model evaluation

Text classification models for spam detection are evaluated with an emphasis on precision and recall for the spam class:
MetricWhy it matters
AccuracyOverall correctness; can be misleading on imbalanced data
Precision (spam)Fraction of predicted spam that is actually spam — high precision means fewer false positives
Recall (spam)Fraction of actual spam that was caught — high recall means fewer missed spam messages
F1 ScoreHarmonic mean of precision and recall
ROC-AUCClassifier performance across all probability thresholds
Typical results on standard spam datasets:
ModelAccuracyPrecision (spam)Recall (spam)F1
Naive Bayes~97%~95%~93%~94%
Logistic Regression~98%~97%~94%~95%
Linear SVM~98%~98%~95%~96%
Do not optimize solely for accuracy. On a dataset where 87% of messages are ham, a model that always predicts “ham” achieves 87% accuracy while catching zero spam. Evaluate using precision, recall, and F1.

Spam indicators

The top TF-IDF features with highest weight for the spam class typically include:
  • High-frequency spam terms: “free”, “win”, “winner”, “prize”, “cash”, “click”, “call now”
  • Urgency signals: “limited time”, “act now”, “expires”
  • Financial lures: “guaranteed”, “no risk”, “earn money”
These can be inspected directly from the trained vectorizer and model coefficients.

Running the project

1

Install dependencies

cd ML_To_Train/06_Email_Spam_Classification
pip install -r requirements.txt
2

Open the notebook

jupyter notebook Email_Spam_Class.ipynb
3

Run the full pipeline

Execute all cells in order. The notebook covers:
  • Data loading and exploration
  • Text cleaning and preprocessing
  • TF-IDF vectorization
  • Model training and comparison
  • Confusion matrix and classification report
4

Classify a new message

After training in the notebook, you can serialize the pipeline for reuse. In the notebook, add:
import joblib

# After fitting pipeline in the notebook
joblib.dump(pipe, "spam_pipeline.pkl")

# Later, load and classify:
pipe = joblib.load("spam_pipeline.pkl")
message = ["Congratulations! You've won a free iPhone. Click now to claim."]
prediction = pipe.predict(message)
print("Spam" if prediction[0] == 1 else "Ham")

Project structure

06_Email_Spam_Classification/

├── resources/
│   └── models.png            # Model comparison visualization

├── Email_Spam_Class.ipynb    # Full pipeline notebook
├── requirements.txt
└── readme.md
This project is notebook-only and does not include a pre-built Flask API or serialized model files. Train the models in the notebook and serialize the pipeline with joblib to use it for inference outside the notebook environment.

Build docs developers (and LLMs) love