Feeding data to a neural network efficiently is just as important as the network architecture itself. Chapter 13 covers TensorFlow’s complete data-loading ecosystem: theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
tf.data API for constructing flexible, high-performance input pipelines; TFRecord files for storing large datasets in a binary format optimised for sequential reads; the TensorFlow Datasets (tfds) library for accessing hundreds of ready-to-use datasets; and Keras preprocessing layers that let you embed data normalisation and encoding directly into the model graph.
What you’ll learn
- Creating datasets with
tf.data.Dataset.from_tensor_slicesand from files - Chaining transformations:
map,filter,batch,repeat,shuffle,prefetch,cache - Reading and writing TFRecord files with
tf.io - Protocol Buffers and the
Example/SequenceExampleprotobuf formats - Loading standard datasets with TensorFlow Datasets (
tfds) - Keras preprocessing layers:
Normalization,Discretization,CategoryEncoding,Hashing,StringLookup,TextVectorization,RandomFlip,RandomRotation - Building a complete CSV input pipeline with
interleaveand parallel reads
Key concepts
The tf.data pipeline model
tf.data.Dataset represents a potentially infinite sequence of elements (batches, images, sentences, etc.). You compose transformations by chaining method calls — each call returns a new dataset without modifying the original. The key performance primitives are:
prefetch(n)— starts preparing the nextnbatches while the model processes the current one, hiding I/O latency behind compute.cache()— stores the dataset in memory (or on disk) after the first epoch, eliminating repeated file reads.interleave()— reads from multiple files in parallel, masking the latency of individual disk seeks.map(fn, num_parallel_calls=tf.data.AUTOTUNE)— applies a preprocessing function to each element concurrently across multiple CPU threads.
TFRecord files
TFRecord is TensorFlow’s native binary file format. Each record is a serialisedtf.train.Example protocol buffer containing named features. Writing your dataset to TFRecord files and reading it back is the most I/O-efficient approach for very large datasets, particularly when stored on network-attached storage or cloud object stores like GCS.
Keras preprocessing layers
Preprocessing layers perform feature engineering inside the model graph, which means:- The same transformation is applied automatically at inference time.
- You can export the entire preprocessing+model as a single SavedModel.
- You get GPU acceleration for certain transforms (e.g. random augmentation).
Normalization computes the mean and variance of the training set via adapt() and applies standardisation. TextVectorization tokenises and vocabulary-maps text. CategoryEncoding one-hot or multi-hot encodes integer features.
Code examples
Basic tf.data pipeline
Production pipeline from CSV files
Keras Normalization preprocessing layer
Writing and reading TFRecord files
Running this notebook
Open in Colab
Exercises
Exercises cover building a TFRecord pipeline for the Fashion MNIST dataset, usingtf.io.parse_single_example to decode serialised protobuf records, and embedding preprocessing layers inside a Keras model. Solutions appear at the bottom of the notebook.