Chapter 1: Introduction to Language Models

Overview

This chapter introduces you to the world of Large Language Models (LLMs) through hands-on exploration. You’ll learn how to load and run a modern LLM, understand the basic workflow of text generation, and get your first taste of working with the Hugging Face Transformers library.

We recommend using a GPU for running the examples in this chapter. If you’re using Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

Learning Objectives

By the end of this chapter, you will:

Understand the basic architecture and workflow of LLMs
Know how to load pre-trained models using Hugging Face Transformers
Be able to generate text using pipelines
Recognize the difference between models and tokenizers

Getting Started with Phi-3

We’ll use Microsoft’s Phi-3 model, a compact yet powerful language model that can run efficiently on consumer hardware.

Setting Up Your Environment

Install Dependencies

First, install the required packages:

pip install transformers==4.41.2 accelerate==0.31.0

Load the Model and Tokenizer

Load both the model and tokenizer from Hugging Face:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Create a Pipeline

Wrap the model in a pipeline for easier text generation:

from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

Understanding the Components

What is a Model?

The model is the neural network that has been trained on vast amounts of text data. It contains billions of parameters that encode patterns and knowledge about language. When you load a model with AutoModelForCausalLM, you’re downloading these pre-trained weights.

Why 'CausalLM'?

The “Causal” in AutoModelForCausalLM refers to causal language modeling, where the model predicts the next token based only on previous tokens (left-to-right generation). This is in contrast to masked language models like BERT, which can see context in both directions.

What is a Tokenizer?

The tokenizer converts text into numbers (tokens) that the model can process, and converts the model’s numerical output back into readable text. Different models use different tokenization strategies, so it’s important to use the matching tokenizer for your model.

Generating Your First Text

Now let’s generate some text! We’ll ask the model to create a funny joke:

# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

Output:

Why did the chicken join the band? Because it had the drumsticks!

The messages format with roles (“user”, “assistant”) is a common pattern for instruction-tuned models. It helps the model understand the conversational context.

Key Parameters Explained

When creating the pipeline, we specified several important parameters:

return_full_text

Set to False to return only the generated text, not the prompt

max_new_tokens

Limits the number of new tokens to generate (controls output length)

do_sample

Set to False for deterministic (greedy) generation, True for random sampling

device_map

Specifies where to load the model (“cuda” for GPU, “cpu” for CPU)

Understanding the Workflow

The text generation process follows these steps:

Tokenization

Your input text is converted into token IDs using the tokenizer

Model Processing

The model processes these tokens and predicts the next token

Generation Loop

This process repeats, with each new token being added to the input

Decoding

The final token IDs are converted back to readable text

Common Use Cases

LLMs like Phi-3 can be used for a wide variety of tasks:

Content Generation: Writing articles, stories, or creative content
Question Answering: Providing informative responses to questions
Code Generation: Writing and explaining code snippets
Text Transformation: Summarizing, translating, or reformatting text
Conversational AI: Building chatbots and virtual assistants

Hardware Considerations

Model Size: Phi-3-mini-4k-instruct is approximately 3.8B parameters, requiring around 7.6GB of VRAM when loaded in float16 precision.Recommended Hardware:

GPU: NVIDIA T4 or better (16GB+ VRAM recommended)
RAM: 16GB+ system memory
Storage: 10GB+ for model files

Troubleshooting

Out of Memory Errors

If you encounter CUDA out of memory errors, try:

Using a smaller model
Reducing max_new_tokens
Loading the model in 8-bit or 4-bit precision using load_in_8bit=True

Slow Generation

To speed up generation:

Ensure you’re using a GPU (device_map="cuda")
Use Flash Attention if available
Consider using torch.compile() for newer PyTorch versions

Model Download Issues

If the model fails to download:

Check your internet connection
Verify you have enough disk space
Try using a Hugging Face access token for rate-limited downloads

Next Steps

Now that you understand the basics of loading and running an LLM, you’re ready to dive deeper into how these models work internally.

Chapter 2: Tokens and Token Embeddings

Learn how text is converted into numerical representations that LLMs can process

Get Started

Foundations

Text Understanding

Text Generation

Retrieval & Multimodal

Fine-Tuning

Overview

Learning Objectives

Getting Started with Phi-3

Setting Up Your Environment

Understanding the Components

What is a Model?

What is a Tokenizer?

Generating Your First Text

Key Parameters Explained

return_full_text

max_new_tokens

do_sample

device_map

Understanding the Workflow

Common Use Cases

Hardware Considerations

Troubleshooting

Next Steps

Chapter 2: Tokens and Token Embeddings

Additional Resources

Build docs developers (and LLMs) love

Get Started

Foundations

Text Understanding

Text Generation

Retrieval & Multimodal

Fine-Tuning

Documentation Index

​Overview

​Learning Objectives

​Getting Started with Phi-3

​Setting Up Your Environment

​Understanding the Components

​What is a Model?

​What is a Tokenizer?

​Generating Your First Text

​Key Parameters Explained

return_full_text

max_new_tokens

do_sample

device_map

​Understanding the Workflow

​Common Use Cases

​Hardware Considerations

​Troubleshooting

​Next Steps

Chapter 2: Tokens and Token Embeddings

​Additional Resources

Build docs developers (and LLMs) love

Overview

Learning Objectives

Getting Started with Phi-3

Setting Up Your Environment

Understanding the Components

What is a Model?

What is a Tokenizer?

Generating Your First Text

Key Parameters Explained

Understanding the Workflow

Common Use Cases

Hardware Considerations

Troubleshooting

Next Steps

Additional Resources