Skip to main content
Open In Colab

Overview

Text classification is one of the most fundamental tasks in Natural Language Processing. This chapter explores how to classify text using both representation models (encoder-based) and generative models (decoder and encoder-decoder models). You’ll learn multiple approaches ranging from zero-shot classification to fine-tuned models, and understand when to use each technique.

What You’ll Learn

1

Representation Models for Classification

Learn how to use pre-trained models like RoBERTa and sentence embeddings for classification tasks
2

Embedding-based Approaches

Discover how to leverage embeddings with traditional ML classifiers and zero-shot techniques
3

Generative Models for Classification

Use encoder-decoder models like FLAN-T5 and ChatGPT for text classification
4

Performance Comparison

Compare different approaches and understand their trade-offs

Use Cases

Text classification powers numerous real-world applications:
  • Sentiment Analysis: Classify movie reviews, product feedback, or social media posts as positive or negative
  • Content Moderation: Automatically detect toxic, spam, or inappropriate content
  • Customer Support: Route support tickets to the appropriate department
  • Document Organization: Categorize emails, news articles, or research papers
  • Intent Detection: Classify user queries in chatbots and virtual assistants

Dataset: Rotten Tomatoes Movie Reviews

Throughout this chapter, we use the Rotten Tomatoes dataset, which contains movie reviews labeled as positive (1) or negative (0).
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data
Output:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})
Example reviews:
data["train"][0, -1]
{
  'text': [
    "the rock is destined to be the 21st century's new 'conan' and that he's going to make a splash even greater than arnold schwarzenegger, jean-claud van damme or steven segal.",
    "things really get weird, though not particularly scary: the movie is all portent and no content."
  ],
  'label': [1, 0]
}

Text Classification with Representation Models

Representation models (encoder-based models like BERT, RoBERTa) excel at understanding and encoding the meaning of text into numerical vectors.

Approach 1: Using a Task-Specific Model

The simplest approach is to use a model that’s already fine-tuned for sentiment analysis.
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)
This model is based on RoBERTa and has been fine-tuned specifically for sentiment analysis on Twitter data. It can classify text into negative, neutral, and positive categories.
Run inference on the test set:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)
Evaluate performance:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

evaluate_performance(data["test"]["label"], y_pred)
Results:
                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066
The model achieves 80% accuracy on movie reviews, despite being trained on Twitter data!

Approach 2: Supervised Classification with Embeddings

Instead of using a task-specific classifier, we can:
  1. Convert text to embeddings using a sentence transformer
  2. Train a traditional ML classifier on these embeddings
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)
The embeddings have shape (8530, 768) - each text is represented as a 768-dimensional vector. Train a logistic regression classifier:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)
Results:
                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066
This achieves 85% accuracy - better than the task-specific model!
Alternative Approach: Instead of using a classifier, you can average the embeddings per class and use cosine similarity:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)
This achieves 84% accuracy without training any classifier!

Approach 3: Zero-Shot Classification

Zero-shot classification doesn’t require any training data - you just provide label descriptions!
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review", "A positive review"])

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)
Results:
                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066
Achieves 78% accuracy with zero training! The label descriptions you choose matter significantly.
Experiment with different descriptions: Try using "A very negative movie review" and "A very positive movie review" to see how results change!

Classification with Generative Models

Generative models can also perform classification by generating the class label as text.

Encoder-Decoder Models (FLAN-T5)

FLAN-T5 is a text-to-text model that can follow instructions and generate responses.
from transformers import pipeline

# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)
Prepare the data with a prompt:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
Run inference:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

evaluate_performance(data["test"]["label"], y_pred)
Results:
                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066
FLAN-T5-small achieves 84% accuracy with simple prompting!

ChatGPT for Classification

Large language models like ChatGPT can perform classification through conversational prompting.
import openai

# Create client
client = openai.OpenAI(api_key="YOUR_KEY_HERE")

def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": prompt.replace("[DOCUMENT]", document)
        }
    ]
    chat_completion = client.chat.completions.create(
        messages=messages,
        model=model,
        temperature=0
    )
    return chat_completion.choices[0].message.content
Create a structured prompt:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious, charming, quirky, original"
chatgpt_generation(prompt, document)  # Returns: '1'
Run on the entire test set (requires API credits):
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)
Results:
                 precision    recall  f1-score   support

Negative Review       0.87      0.97      0.92       533
Positive Review       0.96      0.86      0.91       533

       accuracy                           0.91      1066
      macro avg       0.92      0.91      0.91      1066
   weighted avg       0.92      0.91      0.91      1066
ChatGPT achieves 91% accuracy - the best performance of all methods!

Performance Comparison

# Accuracy: 78%
# Pros: No training required, very flexible
# Cons: Lower accuracy, depends on label descriptions

Practical Applications

Zero-Shot Classification
  • Limited or no training data available
  • Need quick prototyping
  • Working with many different categories
  • Label definitions are clear and distinguishable
Task-Specific Models
  • Domain matches your use case
  • Need fast, consistent performance
  • Have computational constraints
  • Don’t have resources to train custom models
Embeddings + Classifier
  • Have sufficient labeled training data (hundreds to thousands of examples)
  • Need good balance of accuracy and speed
  • Want to use lightweight traditional ML models
  • Need model interpretability
Generative Models (FLAN-T5, ChatGPT)
  • Need highest possible accuracy
  • Have complex classification tasks with nuanced categories
  • Can afford API costs or computation time
  • Working with evolving categories or requirements

Key Takeaways

  1. Multiple paths to classification: Representation models, embeddings, and generative models all offer viable approaches
  2. Zero-shot is powerful: Modern embeddings enable decent classification without any training
  3. Trade-offs matter: Balance accuracy, speed, cost, and training requirements for your use case
  4. Prompt engineering helps: For generative models, well-crafted prompts significantly impact performance
  5. Embeddings are versatile: Sentence embeddings can power multiple approaches (supervised, zero-shot, similarity-based)

Next Steps

In Chapter 5, we’ll explore Text Clustering and Topic Modeling, where you’ll learn to discover patterns and topics in unlabeled text collections.
Try the notebook yourself: Open In Colab

Build docs developers (and LLMs) love