Chapter 3: Looking Inside Transformer LLMs - Hands-On Large Language Models

Overview

This chapter takes you inside the transformer architecture to understand how Large Language Models actually work. You’ll learn about the model’s internal layers, how it processes embeddings, the attention mechanism, and key optimizations like KV caching that make text generation efficient.

This chapter requires a GPU for running examples. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

Learning Objectives

By the end of this chapter, you will:

Understand the transformer architecture and its components
Know how to inspect model layers and parameters
Learn how the language model head produces token probabilities
Understand key-value caching and why it matters
Be able to analyze model outputs and internal states

Setting Up

Install the required dependencies:

pip install transformers>=4.41.2 accelerate>=0.31.0

Loading the Model

Let’s load Phi-3 and examine its architecture:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

Model Architecture Overview

Let’s inspect the model’s structure:

print(model)

Output:

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

Understanding the Architecture

Embedding Layer

Converts token IDs (32,064 vocab) into 3,072-dimensional vectors

Transformer Layers (32 layers)

Each layer contains:

Self-attention: Captures relationships between tokens
MLP (Feed-forward): Processes each position independently
Layer normalization: Stabilizes training

Language Model Head

Projects the final hidden state (3,072 dims) back to vocabulary size (32,064) to produce logits

Key Components Explained

The Attention Mechanism

The attention mechanism allows each token to “look at” other tokens in the sequence:

Query, Key, Value (QKV) Projections

(qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)

This layer projects each token’s embedding into three representations:

Query (Q): What I’m looking for
Key (K): What I have to offer
Value (V): The actual information I contain

The output size is 9,216 = 3 × 3,072 (for Q, K, and V)

Output Projection

(o_proj): Linear(in_features=3072, out_features=3072, bias=False)

After attention scores are computed and applied to values, this layer transforms the result back to the hidden dimension.

Rotary Positional Embeddings

(rotary_emb): Phi3RotaryEmbedding()

Instead of adding positional information to embeddings, RoPE (Rotary Position Embedding) encodes position directly into the attention mechanism, giving the model better position awareness.

The Feed-Forward Network (MLP)

Each transformer layer includes a position-wise feed-forward network:

(mlp): Phi3MLP(
  (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
  (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
  (activation_fn): SiLU()
)

Expansion

First layer expands from 3,072 to 16,384 dimensions (5.3x larger!)

Non-linearity

SiLU (Swish) activation adds non-linear transformations

Compression

Second layer compresses back to 3,072 dimensions

Knowledge Storage

These large intermediate dimensions store factual knowledge

The Inputs and Outputs of the Model

Let’s see what the model actually processes:

prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)
print(output[0]['generated_text'])

Output:

Solution 1:

Subject: My Sincere Apologies for the Gardening Mishap

Dear Sarah,

I hope this message finds you well. I am writing to express my deep

Examining Model Internals

Let’s process a simple prompt and inspect the internal representations:

prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

Understanding the Shapes

print(model_output[0].shape)
# Output: torch.Size([1, 6, 3072])
# [batch_size, sequence_length, hidden_dimension]

print(lm_head_output.shape)  
# Output: torch.Size([1, 6, 32064])
# [batch_size, sequence_length, vocabulary_size]

The model processes all tokens in parallel, producing a hidden state for each position. The language model head then converts each hidden state into a probability distribution over the entire vocabulary.

From Logits to Tokens

The final step is selecting the next token from the probability distribution:

# Get the logits for the last position
token_id = lm_head_output[0, -1].argmax(-1)

# Decode the token
next_token = tokenizer.decode(token_id)
print(next_token)

Output:

Paris

Perfect! The model correctly predicts “Paris” as the next token.

Visualization of the Process

Input: "The capital of France is"
         ↓ (tokenization)
Token IDs: [1, 450, 7483, 310, 3444, 338]
         ↓ (embedding layer)
Embeddings: [batch=1, tokens=6, dim=3072]
         ↓ (32 transformer layers)
Hidden States: [batch=1, tokens=6, dim=3072]
         ↓ (language model head)
Logits: [batch=1, tokens=6, vocab=32064]
         ↓ (argmax on last position)
Next Token ID: 3681
         ↓ (decode)
Output: "Paris"

Optimizing Generation with KV Caching

Text generation requires generating one token at a time. Without optimization, this would be extremely slow!

The Problem

prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

To generate 100 tokens without caching:

Token 1: Process all input tokens
Token 2: Process input + token 1
Token 3: Process input + token 1 + token 2
…
Token 100: Process input + 99 generated tokens

This leads to massive redundant computation!

The Solution: KV Caching

The key insight: attention Keys and Values for previous tokens never change. We can cache them!

First Token

Compute K and V for all input tokens, cache them

Subsequent Tokens

Only compute K and V for the new token, reuse cached values

Massive Speedup

Eliminate redundant computation

Benchmarking the Difference

With caching enabled:

%%timeit -n 1
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)
# Result: 6.66 s ± 2.22 s per loop

With caching disabled:

%%timeit -n 1
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)
# Result: 21.9 s ± 94.6 ms per loop

KV caching provides a 3.3x speedup! This optimization is critical for making LLMs practical for real-time applications.

Memory Considerations

While KV caching speeds up generation, it requires memory:

KV Cache Size = 2 × num_layers × batch_size × max_length × hidden_dim × precision

For Phi-3 generating 1000 tokens:
= 2 × 32 × 1 × 1000 × 3072 × 2 bytes (float16)
= ~375 MB per sequence

For large models and long sequences, KV cache can consume significant VRAM. This is why context length is often limited in production systems.

Model Parameters

Let’s count the parameters in Phi-3:

Embedding Layer

32,064 (vocab) × 3,072 (dim) = 98.4M parameters

Each Transformer Layer

Attention QKV: 3,072 × 9,216 = 28.3M
Attention output: 3,072 × 3,072 = 9.4M
MLP gate/up: 3,072 × 16,384 = 50.3M
MLP down: 8,192 × 3,072 = 25.2M

Total per layer: ~113M parameters

All 32 Layers

32 layers × 113M = 3.6B parameters

LM Head

3,072 (dim) × 32,064 (vocab) = 98.4M parameters

Grand Total

98.4M (embedding) + 3.6B (layers) + 98.4M (lm_head) 
≈ 3.8 billion parameters

Understanding Attention Patterns

Different layers learn different patterns:

Early Layers

Focus on syntax and local patterns (nearby words)

Middle Layers

Capture semantic relationships and facts

Late Layers

Handle high-level reasoning and task-specific patterns

Final Layer

Prepares information for token prediction

Advanced Topics

Temperature and Sampling

When do_sample=True, the model doesn’t just pick the highest probability token:

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    do_sample=True,
    temperature=0.7,  # Lower = more focused, Higher = more creative
    top_p=0.9,        # Nucleus sampling
    top_k=50,         # Top-k sampling
)

Temperature scales the logits before applying softmax:

Temperature = 0.1: Very focused, deterministic
Temperature = 1.0: Normal distribution
Temperature = 2.0: Very random, creative

Batch Processing

The model can process multiple sequences simultaneously:

prompts = [
    "The capital of France is",
    "The capital of Japan is",
    "The capital of Brazil is"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=5)

for i, output in enumerate(outputs):
    print(f"{prompts[i]} {tokenizer.decode(output[len(inputs.input_ids[i]):])}")

Practical Applications

Understanding the internals enables advanced techniques:

Prompt Engineering

Knowing attention mechanisms helps craft better prompts

Fine-tuning

Target specific layers for parameter-efficient training

Inference Optimization

Optimize KV cache and batch sizes

Model Compression

Prune layers or reduce dimensions strategically

Performance Tips

Use Flash Attention

Install flash-attention for 2-3x speedup in attention computation

Enable torch.compile()

PyTorch 2.0+ can compile the model for faster execution

Optimize Batch Size

Larger batches improve GPU utilization but require more memory

Use Mixed Precision

FP16 or BF16 reduces memory and increases speed with minimal quality loss

Next Steps

Now that you understand how LLMs work internally, you’re ready to apply them to real-world tasks!

Get Started

Foundations

Text Understanding

Text Generation

Retrieval & Multimodal

Fine-Tuning

Documentation Index

​Overview

​Learning Objectives

​Setting Up

​Loading the Model

​Model Architecture Overview

​Understanding the Architecture

​Key Components Explained

​The Attention Mechanism

​The Feed-Forward Network (MLP)

Expansion

Non-linearity

Compression

Knowledge Storage

​The Inputs and Outputs of the Model

​Examining Model Internals

​Understanding the Shapes

​From Logits to Tokens

​Visualization of the Process

​Optimizing Generation with KV Caching

​The Problem

​The Solution: KV Caching

​Benchmarking the Difference

​Memory Considerations

​Model Parameters

​Understanding Attention Patterns

Early Layers

Middle Layers

Late Layers

Final Layer

​Advanced Topics

​Temperature and Sampling

​Batch Processing

​Practical Applications

Prompt Engineering

Fine-tuning

Inference Optimization

Model Compression

​Performance Tips

​Next Steps

Chapter 4: Text Classification

Chapter 5: Text Clustering

​Additional Resources

Build docs developers (and LLMs) love

Overview

Learning Objectives

Setting Up

Loading the Model

Model Architecture Overview

Understanding the Architecture

Key Components Explained

The Attention Mechanism

The Feed-Forward Network (MLP)

The Inputs and Outputs of the Model

Examining Model Internals

Understanding the Shapes

From Logits to Tokens

Visualization of the Process

Optimizing Generation with KV Caching

The Problem

The Solution: KV Caching

Benchmarking the Difference

Memory Considerations

Model Parameters

Understanding Attention Patterns

Advanced Topics

Temperature and Sampling

Batch Processing

Practical Applications

Performance Tips

Next Steps

Additional Resources