Overview
This chapter introduces you to the world of Large Language Models (LLMs) through hands-on exploration. You’ll learn how to load and run a modern LLM, understand the basic workflow of text generation, and get your first taste of working with the Hugging Face Transformers library.We recommend using a GPU for running the examples in this chapter. If you’re using Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.
Learning Objectives
By the end of this chapter, you will:- Understand the basic architecture and workflow of LLMs
- Know how to load pre-trained models using Hugging Face Transformers
- Be able to generate text using pipelines
- Recognize the difference between models and tokenizers
Getting Started with Phi-3
We’ll use Microsoft’s Phi-3 model, a compact yet powerful language model that can run efficiently on consumer hardware.Setting Up Your Environment
Understanding the Components
What is a Model?
The model is the neural network that has been trained on vast amounts of text data. It contains billions of parameters that encode patterns and knowledge about language. When you load a model withAutoModelForCausalLM, you’re downloading these pre-trained weights.
Why 'CausalLM'?
Why 'CausalLM'?
The “Causal” in
AutoModelForCausalLM refers to causal language modeling, where the model predicts the next token based only on previous tokens (left-to-right generation). This is in contrast to masked language models like BERT, which can see context in both directions.What is a Tokenizer?
The tokenizer converts text into numbers (tokens) that the model can process, and converts the model’s numerical output back into readable text. Different models use different tokenization strategies, so it’s important to use the matching tokenizer for your model.Generating Your First Text
Now let’s generate some text! We’ll ask the model to create a funny joke:Key Parameters Explained
When creating the pipeline, we specified several important parameters:return_full_text
Set to
False to return only the generated text, not the promptmax_new_tokens
Limits the number of new tokens to generate (controls output length)
do_sample
Set to
False for deterministic (greedy) generation, True for random samplingdevice_map
Specifies where to load the model (“cuda” for GPU, “cpu” for CPU)
Understanding the Workflow
The text generation process follows these steps:Common Use Cases
LLMs like Phi-3 can be used for a wide variety of tasks:- Content Generation: Writing articles, stories, or creative content
- Question Answering: Providing informative responses to questions
- Code Generation: Writing and explaining code snippets
- Text Transformation: Summarizing, translating, or reformatting text
- Conversational AI: Building chatbots and virtual assistants
Hardware Considerations
Model Size: Phi-3-mini-4k-instruct is approximately 3.8B parameters, requiring around 7.6GB of VRAM when loaded in float16 precision.Recommended Hardware:
- GPU: NVIDIA T4 or better (16GB+ VRAM recommended)
- RAM: 16GB+ system memory
- Storage: 10GB+ for model files
Troubleshooting
Out of Memory Errors
Out of Memory Errors
If you encounter CUDA out of memory errors, try:
- Using a smaller model
- Reducing
max_new_tokens - Loading the model in 8-bit or 4-bit precision using
load_in_8bit=True
Slow Generation
Slow Generation
To speed up generation:
- Ensure you’re using a GPU (
device_map="cuda") - Use Flash Attention if available
- Consider using
torch.compile()for newer PyTorch versions
Model Download Issues
Model Download Issues
If the model fails to download:
- Check your internet connection
- Verify you have enough disk space
- Try using a Hugging Face access token for rate-limited downloads
Next Steps
Now that you understand the basics of loading and running an LLM, you’re ready to dive deeper into how these models work internally.Chapter 2: Tokens and Token Embeddings
Learn how text is converted into numerical representations that LLMs can process
