NLP with RNNs, Attention, and Transformers (Ch. 16)

Chapter 16 covers natural language processing from first principles, building up from a character-level Shakespeare text generator through attention-augmented encoder-decoder translation to a from-scratch Transformer. The chapter also shows how to leverage Hugging Face Transformers for large pretrained models like DistilBERT (sentiment analysis) and T5 (text-to-text tasks), and briefly introduces vision transformers (ViT).

What you’ll learn

Building character-level datasets with TextVectorization and tf.data
Training a character-level RNN (GRU) text generator on Shakespeare
Stateful RNNs and how they carry state across batches
Sentiment analysis with word-level embeddings and masking
Encoder-decoder architecture for English-to-Spanish translation
The attention mechanism: how query, key, and value work
Multi-head attention and the full Transformer architecture
Positional encoding
Using Hugging Face Transformers: pipeline, DistilBERT, T5
Introduction to Vision Transformers (ViT)

Key concepts

Character-level text generation

The notebook encodes the complete Shakespeare corpus (1.1 M characters) as integers using TextVectorization. A sliding-window dataset is created where each sample is a sequence of length characters and the target is the same sequence shifted by one position — the model learns to predict the next character given the preceding context. After training, you can sample from the model iteratively to generate new text.

Encoder-decoder and attention

The classic encoder-decoder RNN compresses the entire source sentence into a single context vector, which the decoder uses to produce each output token. This bottleneck limits performance on long sentences. Attention allows the decoder to look back at all encoder hidden states, computing a weighted combination based on relevance to the current decoding step. The weights (alignment scores) are learned end-to-end and are interpretable: they show which source tokens the model attends to when generating each target word.

The Transformer

Transformers replace recurrence entirely with multi-head self-attention. Every position in the sequence attends to every other position simultaneously, making the architecture highly parallelisable. The key components are:

Multi-head attention — multiple attention heads capture different types of relationships.
Positional encoding — sinusoidal signals added to embeddings to inject sequence order.
Feed-forward sublayers — position-wise two-layer MLPs applied identically to each token.
Layer normalisation and residual connections — for training stability.

Transformers have become the dominant architecture for NLP and are increasingly used in vision and audio.

Hugging Face Transformers

The transformers library provides a unified API for hundreds of pretrained models. pipeline("sentiment-analysis") automatically downloads a model and tokeniser; you call it like a function. For fine-tuning, you wrap a pretrained encoder (e.g. DistilBERT) with a custom classification head and train as usual in Keras.

Code examples

Building the Shakespeare character-level dataset

import os
import tensorflow as tf

os.environ["TF_USE_LEGACY_KERAS"] = "1"
import tf_keras  # ensures tf.keras points to Keras 2

shakespeare_url = "https://homl.info/shakespeare"
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]
encoded -= 2  # drop pad and unknown tokens
n_tokens = text_vec_layer.vocabulary_size() - 2  # 39 distinct characters

Creating sliding-window training set

def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True, seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)

Hugging Face sentiment pipeline

from transformers import pipeline

classifier = pipeline("sentiment-analysis")  # downloads DistilBERT by default
result = classifier(["I love this book!", "This was a waste of time."])
# [{'label': 'POSITIVE', 'score': 0.9998}, {'label': 'NEGATIVE', 'score': 0.9997}]

Transformer multi-head attention layer

import tensorflow as tf

# Keras has a built-in MultiHeadAttention layer
attention_layer = tf.keras.layers.MultiHeadAttention(
    num_heads=8, key_dim=64, dropout=0.1)

# In a Transformer encoder block:
attn_output = attention_layer(query=inputs, key=inputs, value=inputs,
                              use_causal_mask=False)

Running this notebook

Enable a GPU

Training the character-level GRU can take over 24 hours on CPU. A GPU reduces this to roughly 1–2 hours. In Colab: Runtime → Change runtime type → GPU.

Open in Colab

Install dependencies

pip install -r requirements.txt

The Hugging Face section requires transformers~=4.35.0.

Keras 2 compatibility

This chapter sets TF_USE_LEGACY_KERAS=1 and imports tf_keras to use Keras 2. This is required because stateful RNNs and ragged tensors work differently in Keras 3.

Exercises

Exercises include training an English-to-French translation model, implementing positional encoding from scratch, and fine-tuning a pretrained transformer on a custom classification task. Solutions are in the notebook.

This chapter uses Keras 2 (TF_USE_LEGACY_KERAS=1) due to compatibility issues with stateful RNNs and ragged tensors in Keras 3. Set the environment variable before importing TensorFlow.

Part I: The Fundamentals

Part II: Neural Networks & Deep Learning

NLP with RNNs, Attention, and Transformers (Ch. 16)

What you’ll learn

Key concepts

Character-level text generation

Encoder-decoder and attention

The Transformer

Hugging Face Transformers

Code examples

Building the Shakespeare character-level dataset

Creating sliding-window training set

Hugging Face sentiment pipeline

Transformer multi-head attention layer

Running this notebook

Exercises

Build docs developers (and LLMs) love

Part I: The Fundamentals

Part II: Neural Networks & Deep Learning

Documentation Index

​What you’ll learn

​Key concepts

​Character-level text generation

​Encoder-decoder and attention

​The Transformer

​Hugging Face Transformers

​Code examples

​Building the Shakespeare character-level dataset

​Creating sliding-window training set

​Hugging Face sentiment pipeline

​Transformer multi-head attention layer

​Running this notebook

​Exercises

Build docs developers (and LLMs) love

What you’ll learn

Key concepts

Character-level text generation

Encoder-decoder and attention

The Transformer

Hugging Face Transformers

Code examples

Building the Shakespeare character-level dataset

Creating sliding-window training set

Hugging Face sentiment pipeline

Transformer multi-head attention layer

Running this notebook

Exercises