Overview
Tokenization is the critical first step in how LLMs process text. This chapter explores how different models break down text into tokens, compares tokenization strategies across popular models, and explains how tokens are converted into numerical embeddings that neural networks can process.We recommend using a GPU for running the examples in this chapter. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.
Learning Objectives
By the end of this chapter, you will:- Understand what tokens are and why they matter
- Compare different tokenization strategies across models
- Learn how to inspect and visualize tokenization
- Understand the relationship between tokens and embeddings
- Recognize common tokenization patterns and edge cases
Setting Up
First, install the required dependencies:Understanding Tokenization
Before an LLM can process text, it must convert words and characters into numbers. This process is called tokenization.Loading a Model and Tokenizer
Tokenizing Text
Let’s see how text gets tokenized:Decoding Individual Tokens
Let’s see what each token represents:Comparing Tokenizers Across Models
Different models use different tokenization strategies. Let’s create a visualization tool to compare them:Test Text with Edge Cases
Let’s create a challenging test string that includes various edge cases:- Normal English words
- All-caps text
- Emojis and Unicode characters
- Python keywords and operators
- Numbers and arithmetic
- Whitespace variations
BERT (Uncased)
- All text is lowercased
- Unknown tokens (emojis) become
[UNK] - Subwords get a
##prefix
BERT (Cased)
GPT-2
- Spaces as explicit tokens (Ġ represents a space)
- Better Unicode support
- More aggressive subword splitting
GPT-4
- Handles whitespace better
- Smarter about Python code
- More compact representation overall
Comparing Token Efficiency
BERT Models
- Vocabulary: ~30K tokens
- Good for: English text classification
- Struggles with: Code, emojis, non-English
GPT Models
- Vocabulary: 50K+ tokens
- Good for: Multilingual, code, generation
- Better Unicode handling
T5 Models
- Vocabulary: 32K tokens
- SentencePiece tokenization
- Good for: Sequence-to-sequence tasks
Code Models
- Vocabulary: 50K+ tokens
- Optimized for programming languages
- Better at individual digits and operators
Understanding Subword Tokenization
Why do models split words into pieces?Token IDs to Text
Let’s explore how individual token IDs map to text:Understanding Special Tokens
Some tokens only make sense in combination! Token 3323 is “Sub” and 622 is “ject”, but together they form “Subject”.
Tokenization Patterns
Common Patterns Across Models
Whitespace Handling
Whitespace Handling
Different tokenizers handle spaces differently:
- BERT: Implicit (no space tokens)
- GPT-2: Explicit (Ġ prefix)
- T5: Underscores for spaces
Numbers and Math
Numbers and Math
Models vary in how they tokenize numbers:
- Some split “12.0” into [“12”, ”.”, “0”]
- Others keep it as one token
- Code-focused models are better at preserving mathematical expressions
Programming Keywords
Programming Keywords
Code-optimized tokenizers have dedicated tokens for:
- Keywords:
False,None,elif,else - Operators:
==,>=,!= - Common patterns:
def,class,import
From Tokens to Embeddings
Once text is tokenized, each token ID is converted to a dense vector (embedding):Practical Implications
Token Limits
Models have maximum context lengths measured in tokens:- GPT-3.5: 4,096 tokens
- GPT-4: 8,192 or 32,768 tokens
- Phi-3-mini-4k: 4,096 tokens
- Claude 2: 100,000 tokens
Cost Considerations
Many API providers charge per token:Visualizing Tokenization
Here’s a comparison of how different tokenizers handle the same text:| Model | ”CAPITALIZATION" | "show_tokens" | "12.0*50=600” |
|---|---|---|---|
| BERT (uncased) | capital ##ization | show _ token ##s | 12 . 0 * 50 = 600 |
| BERT (cased) | CA ##PI ##TA ##L ##I ##Z ##AT ##ION | show _ token ##s | 12 . 0 * 50 = 600 |
| GPT-2 | ĠCAP ITAL IZ ATION | show _t ok ens | 12 . 0 * 50 = 600 |
| GPT-4 | ĠCAPITAL IZATION | show _tokens | 12 . 0 * 50 = 600 |
Best Practices
Match Model and Tokenizer
Always use the tokenizer designed for your model
Test Edge Cases
Verify tokenization for code, numbers, and special characters
Monitor Token Usage
Track tokens for cost and context limit management
Consider Language
Some tokenizers are more efficient for certain languages
Next Steps
Chapter 3: Looking Inside Transformer LLMs
Explore the internal architecture of transformer models and how they process token embeddings
