Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

Natural language processing text-analysis projects cover the full spectrum from lightweight keyword extraction to multi-class deep-learning classifiers. The four projects in this section build a shared foundation: raw text is cleaned and vectorized, then fed into models ranging from logistic regression and TF-IDF pipelines to transformer-based classifiers. Working through them in order gives you a practical understanding of how preprocessing choices, feature representations, and model architectures interact with real-world text data.

Projects at a glance

ProjectTechniqueOutputDataset
Text Emotion Detection (15)Logistic Regression / SVM + TF-IDFEmotion label (joy, anger, fear, …)Emotion-labeled tweet corpora
Sentiment Analysis (41)VADER / BERT fine-tunePositive / Negative / Neutral scoreIMDb, SST-2, or custom reviews
Toxic Comment Filter (21)Multi-label classifier + TF-IDF / BiLSTMToxicity flags (toxic, severe, obscene, …)Jigsaw Toxic Comment dataset
Resume Keyword Extractor (40)TF-IDF, YAKE, spaCy NERRanked keyword listCustom resume corpus

Text preprocessing pipeline

All four projects share a common preprocessing backbone. The snippet below covers tokenization and TF-IDF vectorization — the two steps you will reuse across every project.
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer

# --- Step 1: clean raw text ---
def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)          # strip URLs
    text = re.sub(r"[^a-z0-9\s]", "", text)              # remove punctuation
    text = re.sub(r"\s+", " ", text).strip()
    return text

# --- Step 2: tokenize ---
def tokenize(text: str) -> list[str]:
    return text.split()                                   # swap for NLTK/spaCy as needed

# --- Step 3: vectorize with TF-IDF ---
corpus = [clean_text(doc) for doc in raw_documents]

vectorizer = TfidfVectorizer(
    max_features=10_000,
    ngram_range=(1, 2),   # unigrams + bigrams
    stop_words="english",
    sublinear_tf=True,    # apply log normalization
)
X = vectorizer.fit_transform(corpus)   # sparse matrix (n_samples, n_features)
Goal: Classify a piece of text into one of several discrete emotion categories — joy, sadness, anger, fear, surprise, or disgust.How it works:Text is cleaned and vectorized with TF-IDF. A multi-class classifier (logistic regression or SVM with an OvR strategy) is trained on an emotion-labeled tweet corpus. Because emotions are expressed through specific lexical patterns (“I can’t believe”, “this is amazing”), bag-of-words representations capture them well without requiring sequence modeling.Key steps:
  1. Download and load an emotion-labeled dataset (e.g., the dair-ai/emotion dataset on Hugging Face).
  2. Apply the preprocessing pipeline above.
  3. Train a LogisticRegression or LinearSVC with class_weight="balanced" to handle skewed emotion distributions.
  4. Evaluate with macro-averaged F1 score since classes are imbalanced.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=15000, ngram_range=(1, 2))),
    ("clf",   LogisticRegression(max_iter=1000, class_weight="balanced")),
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
Expected output: A classification report with per-emotion precision, recall, and F1. Macro F1 in the range of 0.70–0.85 depending on the dataset size and label balance.
Goal: Assign a sentiment polarity — positive, negative, or neutral — to user-generated text such as reviews or social media posts.How it works:Two complementary approaches are explored. The lexicon-based VADER scorer works without any training data and is fast enough for production pipelines. The fine-tuned BERT approach yields higher accuracy on domain-specific text but requires a GPU for comfortable training.Approach 1 — VADER (rule-based, no training required):
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text: str) -> str:
    scores = analyzer.polarity_scores(text)
    compound = scores["compound"]
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    return "neutral"

print(get_sentiment("I absolutely loved the product!"))  # positive
Approach 2 — fine-tuned BERT:
from transformers import pipeline

sentiment_pipe = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
)

result = sentiment_pipe("The delivery was shockingly fast.")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]
When to use which: VADER for speed and zero-shot coverage; BERT fine-tuning when you have domain-specific labeled data and need higher F1.
Goal: Detect and flag harmful, abusive, or toxic content in user comments. This is a multi-label problem — a comment can be simultaneously toxic and obscene, for example.How it works:The Jigsaw Toxic Comment dataset provides six binary labels per comment: toxic, severe_toxic, obscene, threat, insult, identity_hate. A MultiOutputClassifier wrapping a LogisticRegression handles all six labels jointly. For higher accuracy, a BiLSTM or a fine-tuned DistilBERT can be substituted.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import roc_auc_score

LABEL_COLS = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

df = pd.read_csv("train.csv")
X_text = df["comment_text"].apply(clean_text)
y = df[LABEL_COLS].values

vec = TfidfVectorizer(max_features=20000, ngram_range=(1, 2), sublinear_tf=True)
X = vec.fit_transform(X_text)

clf = MultiOutputClassifier(LogisticRegression(C=1.0, solver="lbfgs", max_iter=500))
clf.fit(X_train, y_train)

y_pred_proba = clf.predict_proba(X_test)
# Stack per-label probabilities into (n_samples, n_labels)
proba_matrix = [p[:, 1] for p in y_pred_proba]
print("Mean ROC-AUC:", roc_auc_score(y_test, list(zip(*proba_matrix)), average="macro"))
Evaluation metric: ROC-AUC per label (the Kaggle competition metric). Macro-averaged AUC above 0.95 is achievable with TF-IDF + LR on this dataset.
Goal: Automatically pull the most relevant technical and domain-specific keywords from a resume PDF or plain text, ranked by relevance.How it works:Three complementary extraction strategies are combined:
  • TF-IDF ranks terms that appear frequently in the target resume but rarely in a background corpus of general text.
  • YAKE (Yet Another Keyword Extractor) is an unsupervised statistical method that works on a single document with no corpus needed.
  • spaCy NER identifies named entities — companies, technologies, programming languages — that keyword scorers can miss.
import yake
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

nlp = spacy.load("en_core_web_sm")

def extract_keywords_yake(text: str, n: int = 20) -> list[tuple[str, float]]:
    extractor = yake.KeywordExtractor(
        lan="en", n=3, dedupLim=0.8, top=n, features=None
    )
    keywords = extractor.extract_keywords(text)
    # YAKE scores are inverse — lower is better
    return sorted(keywords, key=lambda x: x[1])

def extract_entities(text: str) -> list[str]:
    doc = nlp(text)
    return list({ent.text for ent in doc.ents if ent.label_ in ("ORG", "PRODUCT", "GPE")})

resume_text = open("resume.txt").read()
keywords = extract_keywords_yake(resume_text)
entities = extract_entities(resume_text)

print("Top keywords:", [kw for kw, _ in keywords[:10]])
print("Named entities:", entities[:10])
Practical tip: Combine YAKE keywords with spaCy entities, deduplicate, and rank by TF-IDF weight for the best coverage across both general and domain-specific terms.

Build docs developers (and LLMs) love