Chapter 16 covers natural language processing from first principles, building up from a character-level Shakespeare text generator through attention-augmented encoder-decoder translation to a from-scratch Transformer. The chapter also shows how to leverage Hugging Face Transformers for large pretrained models like DistilBERT (sentiment analysis) and T5 (text-to-text tasks), and briefly introduces vision transformers (ViT).Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
What you’ll learn
- Building character-level datasets with
TextVectorizationandtf.data - Training a character-level RNN (GRU) text generator on Shakespeare
- Stateful RNNs and how they carry state across batches
- Sentiment analysis with word-level embeddings and masking
- Encoder-decoder architecture for English-to-Spanish translation
- The attention mechanism: how query, key, and value work
- Multi-head attention and the full Transformer architecture
- Positional encoding
- Using Hugging Face Transformers:
pipeline,DistilBERT,T5 - Introduction to Vision Transformers (ViT)
Key concepts
Character-level text generation
The notebook encodes the complete Shakespeare corpus (1.1 M characters) as integers usingTextVectorization. A sliding-window dataset is created where each sample is a sequence of length characters and the target is the same sequence shifted by one position — the model learns to predict the next character given the preceding context. After training, you can sample from the model iteratively to generate new text.
Encoder-decoder and attention
The classic encoder-decoder RNN compresses the entire source sentence into a single context vector, which the decoder uses to produce each output token. This bottleneck limits performance on long sentences. Attention allows the decoder to look back at all encoder hidden states, computing a weighted combination based on relevance to the current decoding step. The weights (alignment scores) are learned end-to-end and are interpretable: they show which source tokens the model attends to when generating each target word.The Transformer
Transformers replace recurrence entirely with multi-head self-attention. Every position in the sequence attends to every other position simultaneously, making the architecture highly parallelisable. The key components are:- Multi-head attention — multiple attention heads capture different types of relationships.
- Positional encoding — sinusoidal signals added to embeddings to inject sequence order.
- Feed-forward sublayers — position-wise two-layer MLPs applied identically to each token.
- Layer normalisation and residual connections — for training stability.
Hugging Face Transformers
Thetransformers library provides a unified API for hundreds of pretrained models. pipeline("sentiment-analysis") automatically downloads a model and tokeniser; you call it like a function. For fine-tuning, you wrap a pretrained encoder (e.g. DistilBERT) with a custom classification head and train as usual in Keras.
Code examples
Building the Shakespeare character-level dataset
Creating sliding-window training set
Hugging Face sentiment pipeline
Transformer multi-head attention layer
Running this notebook
Enable a GPU
Training the character-level GRU can take over 24 hours on CPU. A GPU reduces this to roughly 1–2 hours. In Colab: Runtime → Change runtime type → GPU.
Open in Colab