Email spam classification with NLP and text features

This project classifies emails as spam or legitimate (ham) using natural language processing techniques. Raw email text is transformed into numeric feature vectors using TF-IDF (Term Frequency–Inverse Document Frequency) vectorization, and a classifier is trained to distinguish spam patterns from normal communication. The project covers the full NLP classification pipeline from text cleaning through model evaluation.

Overview

Spam detection is a classic binary text classification task. The challenge is not just accuracy — a high false positive rate (legitimate emails classified as spam) has real consequences, so precision on the “ham” class is a key evaluation concern alongside overall accuracy. Problem type: Binary text classification
Target variable: label (0 = ham / legitimate, 1 = spam)
Dataset: Labeled SMS/email message corpus
Primary technique: TF-IDF vectorization + classification

Dataset

The dataset consists of labeled text messages, where each record contains the message text and a binary label indicating whether it is spam or legitimate.

Column	Type	Description
`label`	Binary (target)	`0` = ham (legitimate), `1` = spam
`message`	Text	Raw email or SMS message content

Class distribution is imbalanced — legitimate messages typically make up 85–87% of the dataset. This imbalance influences model selection and threshold tuning.

Text preprocessing pipeline

Before vectorization, raw message text goes through a cleaning pipeline:

TF-IDF vectorization

TF-IDF converts each message into a sparse numeric vector where each dimension represents a term in the vocabulary. Terms that appear frequently in a specific message but rarely across all messages get high weights — this is what makes spam-specific vocabulary (e.g., “free”, “winner”, “click”) stand out.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),   # unigrams and bigrams
    stop_words='english'
)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Models

Multiple classifiers are trained on the TF-IDF feature matrix and compared:

Naive Bayes
Logistic Regression
Support Vector Machine

Multinomial Naive Bayes is well-suited to text classification. It operates on word frequency counts and naturally handles the high-dimensional sparse feature space produced by TF-IDF.

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

Strengths: Fast training, works well on small datasets, interpretable.

Logistic regression with L2 regularization handles sparse high-dimensional input well and provides probability estimates alongside class predictions.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train_tfidf, y_train)

Strengths: Strong baseline, calibrated probabilities, interpretable coefficients.

Linear SVM maximizes the margin between spam and ham decision boundaries in the high-dimensional TF-IDF space.

from sklearn.svm import LinearSVC
model = LinearSVC(C=1.0)
model.fit(X_train_tfidf, y_train)

Strengths: Strong performance on text, handles high-dimensional sparse input efficiently.

Model evaluation

Text classification models for spam detection are evaluated with an emphasis on precision and recall for the spam class:

Metric	Why it matters
Accuracy	Overall correctness; can be misleading on imbalanced data
Precision (spam)	Fraction of predicted spam that is actually spam — high precision means fewer false positives
Recall (spam)	Fraction of actual spam that was caught — high recall means fewer missed spam messages
F1 Score	Harmonic mean of precision and recall
ROC-AUC	Classifier performance across all probability thresholds

Typical results on standard spam datasets:

Model	Accuracy	Precision (spam)	Recall (spam)	F1
Naive Bayes	~97%	~95%	~93%	~94%
Logistic Regression	~98%	~97%	~94%	~95%
Linear SVM	~98%	~98%	~95%	~96%

Do not optimize solely for accuracy. On a dataset where 87% of messages are ham, a model that always predicts “ham” achieves 87% accuracy while catching zero spam. Evaluate using precision, recall, and F1.

Spam indicators

The top TF-IDF features with highest weight for the spam class typically include:

High-frequency spam terms: “free”, “win”, “winner”, “prize”, “cash”, “click”, “call now”
Urgency signals: “limited time”, “act now”, “expires”
Financial lures: “guaranteed”, “no risk”, “earn money”

These can be inspected directly from the trained vectorizer and model coefficients.

Running the project

Install dependencies

cd ML_To_Train/06_Email_Spam_Classification
pip install -r requirements.txt

Open the notebook

jupyter notebook Email_Spam_Class.ipynb

Run the full pipeline

Execute all cells in order. The notebook covers:

Data loading and exploration
Text cleaning and preprocessing
TF-IDF vectorization
Model training and comparison
Confusion matrix and classification report

Classify a new message

After training in the notebook, you can serialize the pipeline for reuse. In the notebook, add:

import joblib

# After fitting pipeline in the notebook
joblib.dump(pipe, "spam_pipeline.pkl")

# Later, load and classify:
pipe = joblib.load("spam_pipeline.pkl")
message = ["Congratulations! You've won a free iPhone. Click now to claim."]
prediction = pipe.predict(message)
print("Spam" if prediction[0] == 1 else "Ham")

Project structure

06_Email_Spam_Classification/
│
├── resources/
│   └── models.png            # Model comparison visualization
│
├── Email_Spam_Class.ipynb    # Full pipeline notebook
├── requirements.txt
└── readme.md

This project is notebook-only and does not include a pre-built Flask API or serialized model files. Train the models in the notebook and serialize the pipeline with joblib to use it for inference outside the notebook environment.

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Email spam classification with NLP and text features

Overview

Dataset

Text preprocessing pipeline

TF-IDF vectorization

Models

Model evaluation

Spam indicators

Running the project

Project structure

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Overview

​Dataset

​Text preprocessing pipeline

​TF-IDF vectorization

​Models

​Model evaluation

​Spam indicators

​Running the project

​Project structure

Build docs developers (and LLMs) love

Overview

Dataset

Text preprocessing pipeline

TF-IDF vectorization

Models

Model evaluation

Spam indicators

Running the project

Project structure