Overview
This chapter takes you inside the transformer architecture to understand how Large Language Models actually work. You’ll learn about the model’s internal layers, how it processes embeddings, the attention mechanism, and key optimizations like KV caching that make text generation efficient.
This chapter requires a GPU for running examples. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4 .
Learning Objectives
By the end of this chapter, you will:
Understand the transformer architecture and its components
Know how to inspect model layers and parameters
Learn how the language model head produces token probabilities
Understand key-value caching and why it matters
Be able to analyze model outputs and internal states
Setting Up
Install the required dependencies:
pip install transformer s > = 4.41.2 accelerat e > = 0.31.0
Loading the Model
Let’s load Phi-3 and examine its architecture:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained( "microsoft/Phi-3-mini-4k-instruct" )
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct" ,
device_map = "cuda" ,
torch_dtype = "auto" ,
trust_remote_code = False ,
)
# Create a pipeline
generator = pipeline(
"text-generation" ,
model = model,
tokenizer = tokenizer,
return_full_text = False ,
max_new_tokens = 50 ,
do_sample = False ,
)
Model Architecture Overview
Let’s inspect the model’s structure:
Output:
Phi3ForCausalLM(
(model): Phi3Model(
(embed_tokens): Embedding(32064, 3072, padding_idx=32000)
(embed_dropout): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0-31): 32 x Phi3DecoderLayer(
(self_attn): Phi3Attention(
(o_proj): Linear(in_features=3072, out_features=3072, bias=False)
(qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
(rotary_emb): Phi3RotaryEmbedding()
)
(mlp): Phi3MLP(
(gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
(down_proj): Linear(in_features=8192, out_features=3072, bias=False)
(activation_fn): SiLU()
)
(input_layernorm): Phi3RMSNorm()
(resid_attn_dropout): Dropout(p=0.0, inplace=False)
(resid_mlp_dropout): Dropout(p=0.0, inplace=False)
(post_attention_layernorm): Phi3RMSNorm()
)
)
(norm): Phi3RMSNorm()
)
(lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)
Understanding the Architecture
Embedding Layer
Converts token IDs (32,064 vocab) into 3,072-dimensional vectors
Transformer Layers (32 layers)
Each layer contains:
Self-attention : Captures relationships between tokens
MLP (Feed-forward) : Processes each position independently
Layer normalization : Stabilizes training
Language Model Head
Projects the final hidden state (3,072 dims) back to vocabulary size (32,064) to produce logits
Key Components Explained
The Attention Mechanism
The attention mechanism allows each token to “look at” other tokens in the sequence:
Query, Key, Value (QKV) Projections
(qkv_proj): Linear( in_features = 3072 , out_features = 9216 , bias = False )
This layer projects each token’s embedding into three representations:
Query (Q) : What I’m looking for
Key (K) : What I have to offer
Value (V) : The actual information I contain
The output size is 9,216 = 3 × 3,072 (for Q, K, and V)
(o_proj): Linear( in_features = 3072 , out_features = 3072 , bias = False )
After attention scores are computed and applied to values, this layer transforms the result back to the hidden dimension.
Rotary Positional Embeddings
(rotary_emb): Phi3RotaryEmbedding()
Instead of adding positional information to embeddings, RoPE (Rotary Position Embedding) encodes position directly into the attention mechanism, giving the model better position awareness.
The Feed-Forward Network (MLP)
Each transformer layer includes a position-wise feed-forward network:
(mlp): Phi3MLP(
(gate_up_proj): Linear( in_features = 3072 , out_features = 16384 , bias = False )
(down_proj): Linear( in_features = 8192 , out_features = 3072 , bias = False )
(activation_fn): SiLU()
)
Expansion First layer expands from 3,072 to 16,384 dimensions (5.3x larger!)
Non-linearity SiLU (Swish) activation adds non-linear transformations
Compression Second layer compresses back to 3,072 dimensions
Knowledge Storage These large intermediate dimensions store factual knowledge
Let’s see what the model actually processes:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."
output = generator(prompt)
print (output[ 0 ][ 'generated_text' ])
Output:
Solution 1:
Subject: My Sincere Apologies for the Gardening Mishap
Dear Sarah,
I hope this message finds you well. I am writing to express my deep
Examining Model Internals
Let’s process a simple prompt and inspect the internal representations:
prompt = "The capital of France is"
# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors = "pt" ).input_ids.to( "cuda" )
# Get the output of the model before the lm_head
model_output = model.model(input_ids)
# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[ 0 ])
Understanding the Shapes
print (model_output[ 0 ].shape)
# Output: torch.Size([1, 6, 3072])
# [batch_size, sequence_length, hidden_dimension]
print (lm_head_output.shape)
# Output: torch.Size([1, 6, 32064])
# [batch_size, sequence_length, vocabulary_size]
The model processes all tokens in parallel , producing a hidden state for each position. The language model head then converts each hidden state into a probability distribution over the entire vocabulary.
From Logits to Tokens
The final step is selecting the next token from the probability distribution:
# Get the logits for the last position
token_id = lm_head_output[ 0 , - 1 ].argmax( - 1 )
# Decode the token
next_token = tokenizer.decode(token_id)
print (next_token)
Output:
Perfect! The model correctly predicts “Paris” as the next token.
Visualization of the Process
Input: "The capital of France is"
↓ (tokenization)
Token IDs: [1, 450, 7483, 310, 3444, 338]
↓ (embedding layer)
Embeddings: [batch=1, tokens=6, dim=3072]
↓ (32 transformer layers)
Hidden States: [batch=1, tokens=6, dim=3072]
↓ (language model head)
Logits: [batch=1, tokens=6, vocab=32064]
↓ (argmax on last position)
Next Token ID: 3681
↓ (decode)
Output: "Paris"
Optimizing Generation with KV Caching
Text generation requires generating one token at a time. Without optimization, this would be extremely slow!
The Problem
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."
input_ids = tokenizer(prompt, return_tensors = "pt" ).input_ids.to( "cuda" )
To generate 100 tokens without caching:
Token 1 : Process all input tokens
Token 2 : Process input + token 1
Token 3 : Process input + token 1 + token 2
…
Token 100 : Process input + 99 generated tokens
This leads to massive redundant computation!
The Solution: KV Caching
The key insight: attention Keys and Values for previous tokens never change. We can cache them!
First Token
Compute K and V for all input tokens, cache them
Subsequent Tokens
Only compute K and V for the new token, reuse cached values
Massive Speedup
Eliminate redundant computation
Benchmarking the Difference
With caching enabled:
%% timeit - n 1
generation_output = model.generate(
input_ids = input_ids,
max_new_tokens = 100 ,
use_cache = True
)
# Result: 6.66 s ± 2.22 s per loop
With caching disabled:
%% timeit - n 1
generation_output = model.generate(
input_ids = input_ids,
max_new_tokens = 100 ,
use_cache = False
)
# Result: 21.9 s ± 94.6 ms per loop
KV caching provides a 3.3x speedup! This optimization is critical for making LLMs practical for real-time applications.
Memory Considerations
While KV caching speeds up generation, it requires memory:
KV Cache Size = 2 × num_layers × batch_size × max_length × hidden_dim × precision
For Phi-3 generating 1000 tokens:
= 2 × 32 × 1 × 1000 × 3072 × 2 bytes (float16)
= ~375 MB per sequence
For large models and long sequences, KV cache can consume significant VRAM. This is why context length is often limited in production systems.
Model Parameters
Let’s count the parameters in Phi-3:
32,064 (vocab) × 3,072 (dim) = 98.4M parameters
32 layers × 113M = 3.6B parameters
3,072 (dim) × 32,064 (vocab) = 98.4M parameters
98.4M (embedding) + 3.6B (layers) + 98.4M (lm_head)
≈ 3.8 billion parameters
Understanding Attention Patterns
Different layers learn different patterns:
Early Layers Focus on syntax and local patterns (nearby words)
Middle Layers Capture semantic relationships and facts
Late Layers Handle high-level reasoning and task-specific patterns
Final Layer Prepares information for token prediction
Advanced Topics
Temperature and Sampling
When do_sample=True, the model doesn’t just pick the highest probability token:
generator = pipeline(
"text-generation" ,
model = model,
tokenizer = tokenizer,
do_sample = True ,
temperature = 0.7 , # Lower = more focused, Higher = more creative
top_p = 0.9 , # Nucleus sampling
top_k = 50 , # Top-k sampling
)
Temperature scales the logits before applying softmax:
Temperature = 0.1: Very focused, deterministic
Temperature = 1.0: Normal distribution
Temperature = 2.0: Very random, creative
Batch Processing
The model can process multiple sequences simultaneously:
prompts = [
"The capital of France is" ,
"The capital of Japan is" ,
"The capital of Brazil is"
]
inputs = tokenizer(prompts, return_tensors = "pt" , padding = True ).to( "cuda" )
outputs = model.generate( ** inputs, max_new_tokens = 5 )
for i, output in enumerate (outputs):
print ( f " { prompts[i] } { tokenizer.decode(output[ len (inputs.input_ids[i]):]) } " )
Practical Applications
Understanding the internals enables advanced techniques:
Prompt Engineering Knowing attention mechanisms helps craft better prompts
Fine-tuning Target specific layers for parameter-efficient training
Inference Optimization Optimize KV cache and batch sizes
Model Compression Prune layers or reduce dimensions strategically
Use Flash Attention
Install flash-attention for 2-3x speedup in attention computation
Enable torch.compile()
PyTorch 2.0+ can compile the model for faster execution
Optimize Batch Size
Larger batches improve GPU utilization but require more memory
Use Mixed Precision
FP16 or BF16 reduces memory and increases speed with minimal quality loss
Next Steps
Now that you understand how LLMs work internally, you’re ready to apply them to real-world tasks!
Chapter 4: Text Classification Learn how to use LLMs for classification tasks
Chapter 5: Text Clustering Explore unsupervised learning with embeddings
Additional Resources