Skip to main content
Open In Colab

Overview

This chapter takes you inside the transformer architecture to understand how Large Language Models actually work. You’ll learn about the model’s internal layers, how it processes embeddings, the attention mechanism, and key optimizations like KV caching that make text generation efficient.
This chapter requires a GPU for running examples. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

Learning Objectives

By the end of this chapter, you will:
  • Understand the transformer architecture and its components
  • Know how to inspect model layers and parameters
  • Learn how the language model head produces token probabilities
  • Understand key-value caching and why it matters
  • Be able to analyze model outputs and internal states

Setting Up

Install the required dependencies:
pip install transformers>=4.41.2 accelerate>=0.31.0

Loading the Model

Let’s load Phi-3 and examine its architecture:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

Model Architecture Overview

Let’s inspect the model’s structure:
print(model)
Output:
Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

Understanding the Architecture

1

Embedding Layer

Converts token IDs (32,064 vocab) into 3,072-dimensional vectors
2

Transformer Layers (32 layers)

Each layer contains:
  • Self-attention: Captures relationships between tokens
  • MLP (Feed-forward): Processes each position independently
  • Layer normalization: Stabilizes training
3

Language Model Head

Projects the final hidden state (3,072 dims) back to vocabulary size (32,064) to produce logits

Key Components Explained

The Attention Mechanism

The attention mechanism allows each token to “look at” other tokens in the sequence:
(qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
This layer projects each token’s embedding into three representations:
  • Query (Q): What I’m looking for
  • Key (K): What I have to offer
  • Value (V): The actual information I contain
The output size is 9,216 = 3 × 3,072 (for Q, K, and V)
(o_proj): Linear(in_features=3072, out_features=3072, bias=False)
After attention scores are computed and applied to values, this layer transforms the result back to the hidden dimension.
(rotary_emb): Phi3RotaryEmbedding()
Instead of adding positional information to embeddings, RoPE (Rotary Position Embedding) encodes position directly into the attention mechanism, giving the model better position awareness.

The Feed-Forward Network (MLP)

Each transformer layer includes a position-wise feed-forward network:
(mlp): Phi3MLP(
  (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
  (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
  (activation_fn): SiLU()
)

Expansion

First layer expands from 3,072 to 16,384 dimensions (5.3x larger!)

Non-linearity

SiLU (Swish) activation adds non-linear transformations

Compression

Second layer compresses back to 3,072 dimensions

Knowledge Storage

These large intermediate dimensions store factual knowledge

The Inputs and Outputs of the Model

Let’s see what the model actually processes:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)
print(output[0]['generated_text'])
Output:

Solution 1:

Subject: My Sincere Apologies for the Gardening Mishap


Dear Sarah,


I hope this message finds you well. I am writing to express my deep

Examining Model Internals

Let’s process a simple prompt and inspect the internal representations:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

Understanding the Shapes

print(model_output[0].shape)
# Output: torch.Size([1, 6, 3072])
# [batch_size, sequence_length, hidden_dimension]

print(lm_head_output.shape)  
# Output: torch.Size([1, 6, 32064])
# [batch_size, sequence_length, vocabulary_size]
The model processes all tokens in parallel, producing a hidden state for each position. The language model head then converts each hidden state into a probability distribution over the entire vocabulary.

From Logits to Tokens

The final step is selecting the next token from the probability distribution:
# Get the logits for the last position
token_id = lm_head_output[0, -1].argmax(-1)

# Decode the token
next_token = tokenizer.decode(token_id)
print(next_token)
Output:
Paris
Perfect! The model correctly predicts “Paris” as the next token.

Visualization of the Process

Input: "The capital of France is"
         ↓ (tokenization)
Token IDs: [1, 450, 7483, 310, 3444, 338]
         ↓ (embedding layer)
Embeddings: [batch=1, tokens=6, dim=3072]
         ↓ (32 transformer layers)
Hidden States: [batch=1, tokens=6, dim=3072]
         ↓ (language model head)
Logits: [batch=1, tokens=6, vocab=32064]
         ↓ (argmax on last position)
Next Token ID: 3681
         ↓ (decode)
Output: "Paris"

Optimizing Generation with KV Caching

Text generation requires generating one token at a time. Without optimization, this would be extremely slow!

The Problem

prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
To generate 100 tokens without caching:
  • Token 1: Process all input tokens
  • Token 2: Process input + token 1
  • Token 3: Process input + token 1 + token 2
  • Token 100: Process input + 99 generated tokens
This leads to massive redundant computation!

The Solution: KV Caching

The key insight: attention Keys and Values for previous tokens never change. We can cache them!
1

First Token

Compute K and V for all input tokens, cache them
2

Subsequent Tokens

Only compute K and V for the new token, reuse cached values
3

Massive Speedup

Eliminate redundant computation

Benchmarking the Difference

With caching enabled:
%%timeit -n 1
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)
# Result: 6.66 s ± 2.22 s per loop
With caching disabled:
%%timeit -n 1
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)
# Result: 21.9 s ± 94.6 ms per loop
KV caching provides a 3.3x speedup! This optimization is critical for making LLMs practical for real-time applications.

Memory Considerations

While KV caching speeds up generation, it requires memory:
KV Cache Size = 2 × num_layers × batch_size × max_length × hidden_dim × precision

For Phi-3 generating 1000 tokens:
= 2 × 32 × 1 × 1000 × 3072 × 2 bytes (float16)
= ~375 MB per sequence
For large models and long sequences, KV cache can consume significant VRAM. This is why context length is often limited in production systems.

Model Parameters

Let’s count the parameters in Phi-3:
32,064 (vocab) × 3,072 (dim) = 98.4M parameters
  • Attention QKV: 3,072 × 9,216 = 28.3M
  • Attention output: 3,072 × 3,072 = 9.4M
  • MLP gate/up: 3,072 × 16,384 = 50.3M
  • MLP down: 8,192 × 3,072 = 25.2M
Total per layer: ~113M parameters
32 layers × 113M = 3.6B parameters
3,072 (dim) × 32,064 (vocab) = 98.4M parameters
98.4M (embedding) + 3.6B (layers) + 98.4M (lm_head) 
≈ 3.8 billion parameters

Understanding Attention Patterns

Different layers learn different patterns:

Early Layers

Focus on syntax and local patterns (nearby words)

Middle Layers

Capture semantic relationships and facts

Late Layers

Handle high-level reasoning and task-specific patterns

Final Layer

Prepares information for token prediction

Advanced Topics

Temperature and Sampling

When do_sample=True, the model doesn’t just pick the highest probability token:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    do_sample=True,
    temperature=0.7,  # Lower = more focused, Higher = more creative
    top_p=0.9,        # Nucleus sampling
    top_k=50,         # Top-k sampling
)
Temperature scales the logits before applying softmax:
  • Temperature = 0.1: Very focused, deterministic
  • Temperature = 1.0: Normal distribution
  • Temperature = 2.0: Very random, creative

Batch Processing

The model can process multiple sequences simultaneously:
prompts = [
    "The capital of France is",
    "The capital of Japan is",
    "The capital of Brazil is"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=5)

for i, output in enumerate(outputs):
    print(f"{prompts[i]} {tokenizer.decode(output[len(inputs.input_ids[i]):])}")

Practical Applications

Understanding the internals enables advanced techniques:

Prompt Engineering

Knowing attention mechanisms helps craft better prompts

Fine-tuning

Target specific layers for parameter-efficient training

Inference Optimization

Optimize KV cache and batch sizes

Model Compression

Prune layers or reduce dimensions strategically

Performance Tips

1

Use Flash Attention

Install flash-attention for 2-3x speedup in attention computation
2

Enable torch.compile()

PyTorch 2.0+ can compile the model for faster execution
3

Optimize Batch Size

Larger batches improve GPU utilization but require more memory
4

Use Mixed Precision

FP16 or BF16 reduces memory and increases speed with minimal quality loss

Next Steps

Now that you understand how LLMs work internally, you’re ready to apply them to real-world tasks!

Chapter 4: Text Classification

Learn how to use LLMs for classification tasks

Chapter 5: Text Clustering

Explore unsupervised learning with embeddings

Additional Resources

Build docs developers (and LLMs) love