Skip to main content
Open In Colab

Overview

This chapter introduces you to the world of Large Language Models (LLMs) through hands-on exploration. You’ll learn how to load and run a modern LLM, understand the basic workflow of text generation, and get your first taste of working with the Hugging Face Transformers library.
We recommend using a GPU for running the examples in this chapter. If you’re using Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

Learning Objectives

By the end of this chapter, you will:
  • Understand the basic architecture and workflow of LLMs
  • Know how to load pre-trained models using Hugging Face Transformers
  • Be able to generate text using pipelines
  • Recognize the difference between models and tokenizers

Getting Started with Phi-3

We’ll use Microsoft’s Phi-3 model, a compact yet powerful language model that can run efficiently on consumer hardware.

Setting Up Your Environment

1

Install Dependencies

First, install the required packages:
pip install transformers==4.41.2 accelerate==0.31.0
2

Load the Model and Tokenizer

Load both the model and tokenizer from Hugging Face:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
3

Create a Pipeline

Wrap the model in a pipeline for easier text generation:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

Understanding the Components

What is a Model?

The model is the neural network that has been trained on vast amounts of text data. It contains billions of parameters that encode patterns and knowledge about language. When you load a model with AutoModelForCausalLM, you’re downloading these pre-trained weights.
The “Causal” in AutoModelForCausalLM refers to causal language modeling, where the model predicts the next token based only on previous tokens (left-to-right generation). This is in contrast to masked language models like BERT, which can see context in both directions.

What is a Tokenizer?

The tokenizer converts text into numbers (tokens) that the model can process, and converts the model’s numerical output back into readable text. Different models use different tokenization strategies, so it’s important to use the matching tokenizer for your model.

Generating Your First Text

Now let’s generate some text! We’ll ask the model to create a funny joke:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])
Output:
Why did the chicken join the band? Because it had the drumsticks!
The messages format with roles (“user”, “assistant”) is a common pattern for instruction-tuned models. It helps the model understand the conversational context.

Key Parameters Explained

When creating the pipeline, we specified several important parameters:

return_full_text

Set to False to return only the generated text, not the prompt

max_new_tokens

Limits the number of new tokens to generate (controls output length)

do_sample

Set to False for deterministic (greedy) generation, True for random sampling

device_map

Specifies where to load the model (“cuda” for GPU, “cpu” for CPU)

Understanding the Workflow

The text generation process follows these steps:
1

Tokenization

Your input text is converted into token IDs using the tokenizer
2

Model Processing

The model processes these tokens and predicts the next token
3

Generation Loop

This process repeats, with each new token being added to the input
4

Decoding

The final token IDs are converted back to readable text

Common Use Cases

LLMs like Phi-3 can be used for a wide variety of tasks:
  • Content Generation: Writing articles, stories, or creative content
  • Question Answering: Providing informative responses to questions
  • Code Generation: Writing and explaining code snippets
  • Text Transformation: Summarizing, translating, or reformatting text
  • Conversational AI: Building chatbots and virtual assistants

Hardware Considerations

Model Size: Phi-3-mini-4k-instruct is approximately 3.8B parameters, requiring around 7.6GB of VRAM when loaded in float16 precision.Recommended Hardware:
  • GPU: NVIDIA T4 or better (16GB+ VRAM recommended)
  • RAM: 16GB+ system memory
  • Storage: 10GB+ for model files

Troubleshooting

If you encounter CUDA out of memory errors, try:
  • Using a smaller model
  • Reducing max_new_tokens
  • Loading the model in 8-bit or 4-bit precision using load_in_8bit=True
To speed up generation:
  • Ensure you’re using a GPU (device_map="cuda")
  • Use Flash Attention if available
  • Consider using torch.compile() for newer PyTorch versions
If the model fails to download:
  • Check your internet connection
  • Verify you have enough disk space
  • Try using a Hugging Face access token for rate-limited downloads

Next Steps

Now that you understand the basics of loading and running an LLM, you’re ready to dive deeper into how these models work internally.

Chapter 2: Tokens and Token Embeddings

Learn how text is converted into numerical representations that LLMs can process

Additional Resources

Build docs developers (and LLMs) love