NLP text analysis: emotion, sentiment, and spam detection

Project	Technique	Output	Dataset
Text Emotion Detection (15)	Logistic Regression / SVM + TF-IDF	Emotion label (joy, anger, fear, …)	Emotion-labeled tweet corpora
Sentiment Analysis (41)	VADER / BERT fine-tune	Positive / Negative / Neutral score	IMDb, SST-2, or custom reviews
Toxic Comment Filter (21)	Multi-label classifier + TF-IDF / BiLSTM	Toxicity flags (toxic, severe, obscene, …)	Jigsaw Toxic Comment dataset
Resume Keyword Extractor (40)	TF-IDF, YAKE, spaCy NER	Ranked keyword list	Custom resume corpus

Text Emotion Detection (Project 15)

Goal: Classify a piece of text into one of several discrete emotion categories — joy, sadness, anger, fear, surprise, or disgust.How it works:Text is cleaned and vectorized with TF-IDF. A multi-class classifier (logistic regression or SVM with an OvR strategy) is trained on an emotion-labeled tweet corpus. Because emotions are expressed through specific lexical patterns (“I can’t believe”, “this is amazing”), bag-of-words representations capture them well without requiring sequence modeling.Key steps:

Download and load an emotion-labeled dataset (e.g., the dair-ai/emotion dataset on Hugging Face).
Apply the preprocessing pipeline above.
Train a LogisticRegression or LinearSVC with class_weight="balanced" to handle skewed emotion distributions.
Evaluate with macro-averaged F1 score since classes are imbalanced.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=15000, ngram_range=(1, 2))),
    ("clf",   LogisticRegression(max_iter=1000, class_weight="balanced")),
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))

Expected output: A classification report with per-emotion precision, recall, and F1. Macro F1 in the range of 0.70–0.85 depending on the dataset size and label balance.

Sentiment Analysis (Project 41)

Goal: Assign a sentiment polarity — positive, negative, or neutral — to user-generated text such as reviews or social media posts.How it works:Two complementary approaches are explored. The lexicon-based VADER scorer works without any training data and is fast enough for production pipelines. The fine-tuned BERT approach yields higher accuracy on domain-specific text but requires a GPU for comfortable training.Approach 1 — VADER (rule-based, no training required):

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text: str) -> str:
    scores = analyzer.polarity_scores(text)
    compound = scores["compound"]
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    return "neutral"

print(get_sentiment("I absolutely loved the product!"))  # positive

Approach 2 — fine-tuned BERT:

from transformers import pipeline

sentiment_pipe = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
)

result = sentiment_pipe("The delivery was shockingly fast.")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

When to use which: VADER for speed and zero-shot coverage; BERT fine-tuning when you have domain-specific labeled data and need higher F1.

Toxic Comment Filter (Project 21)

Goal: Detect and flag harmful, abusive, or toxic content in user comments. This is a multi-label problem — a comment can be simultaneously toxic and obscene, for example.How it works:The Jigsaw Toxic Comment dataset provides six binary labels per comment: toxic, severe_toxic, obscene, threat, insult, identity_hate. A MultiOutputClassifier wrapping a LogisticRegression handles all six labels jointly. For higher accuracy, a BiLSTM or a fine-tuned DistilBERT can be substituted.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import roc_auc_score

LABEL_COLS = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

df = pd.read_csv("train.csv")
X_text = df["comment_text"].apply(clean_text)
y = df[LABEL_COLS].values

vec = TfidfVectorizer(max_features=20000, ngram_range=(1, 2), sublinear_tf=True)
X = vec.fit_transform(X_text)

clf = MultiOutputClassifier(LogisticRegression(C=1.0, solver="lbfgs", max_iter=500))
clf.fit(X_train, y_train)

y_pred_proba = clf.predict_proba(X_test)
# Stack per-label probabilities into (n_samples, n_labels)
proba_matrix = [p[:, 1] for p in y_pred_proba]
print("Mean ROC-AUC:", roc_auc_score(y_test, list(zip(*proba_matrix)), average="macro"))

Evaluation metric: ROC-AUC per label (the Kaggle competition metric). Macro-averaged AUC above 0.95 is achievable with TF-IDF + LR on this dataset.

Resume Keyword Extractor (Project 40)

Goal: Automatically pull the most relevant technical and domain-specific keywords from a resume PDF or plain text, ranked by relevance.How it works:Three complementary extraction strategies are combined:

TF-IDF ranks terms that appear frequently in the target resume but rarely in a background corpus of general text.
YAKE (Yet Another Keyword Extractor) is an unsupervised statistical method that works on a single document with no corpus needed.
spaCy NER identifies named entities — companies, technologies, programming languages — that keyword scorers can miss.

import yake
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

nlp = spacy.load("en_core_web_sm")

def extract_keywords_yake(text: str, n: int = 20) -> list[tuple[str, float]]:
    extractor = yake.KeywordExtractor(
        lan="en", n=3, dedupLim=0.8, top=n, features=None
    )
    keywords = extractor.extract_keywords(text)
    # YAKE scores are inverse — lower is better
    return sorted(keywords, key=lambda x: x[1])

def extract_entities(text: str) -> list[str]:
    doc = nlp(text)
    return list({ent.text for ent in doc.ents if ent.label_ in ("ORG", "PRODUCT", "GPE")})

resume_text = open("resume.txt").read()
keywords = extract_keywords_yake(resume_text)
entities = extract_entities(resume_text)

print("Top keywords:", [kw for kw, _ in keywords[:10]])
print("Named entities:", entities[:10])

Practical tip: Combine YAKE keywords with spaCy entities, deduplicate, and rank by TF-IDF weight for the best coverage across both general and domain-specific terms.

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

NLP text analysis: emotion, sentiment, and spam detection

Projects at a glance

Text preprocessing pipeline

Build docs developers (and LLMs) love