What you’ll learn
By the end of this tutorial, you’ll know how to:- Understand when to use GRPO vs supervised fine-tuning
- Set up your environment for GRPO training
- Prepare datasets for reinforcement learning
- Design reward functions for verifiable tasks
- Train models with GRPO for improved reasoning
- Test and evaluate your fine-tuned model
- Export models for deployment
Prerequisites
- GPU: This tutorial requires a GPU with at least 16GB memory. You can run it for free on Google Colab using an NVIDIA T4 GPU
- Python: Python 3.8 or higher
- Basic knowledge: Understanding of reinforcement learning concepts is helpful
What is GRPO?
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that:- Trains models by rewarding correct outputs and penalizing incorrect ones
- Uses programmatic verification to evaluate model responses
- Improves performance on tasks with clear success criteria
- Is more sample-efficient than traditional RL approaches
When to use GRPO
GRPO is ideal for verifiable tasks where you can programmatically evaluate correctness:- Mathematical reasoning: Check if the final answer matches the expected result
- Code generation: Run unit tests to verify code correctness
- Structured output: Validate JSON/SQL against schemas
- Question answering: Compare against ground truth answers
- Open-ended creative tasks without clear success metrics
- Subjective tasks like style transfer or creative writing
- Simple instruction following (use SFT instead)
Tutorial overview
The tutorial covers the following steps:- Installation: Set up Unsloth, vLLM, and required dependencies
- Pre-training phase: Format adaptation through SFT
- Data preparation: Load and format the Open R1 Math dataset
- Reward function: Design a reward function for mathematical correctness
- GRPO training: Configure and run reinforcement learning training
- Inference: Test your model’s reasoning capabilities
- Export: Save your model for deployment
Key concepts
Two-phase training approach
The tutorial uses a two-phase approach: Phase 1 - Format adaptation (SFT):- Pre-trains the model on the output format
- Helps GRPO focus on correctness rather than formatting
- Speeds up overall training time
- Uses reward signals to improve reasoning
- Learns from both successful and failed attempts
- Optimizes for task accuracy
Reward function design
The reward function evaluates model outputs:- Provide clear signals (correct vs incorrect)
- Be deterministic and consistent
- Execute quickly for training efficiency
vLLM integration
The tutorial uses vLLM for efficient inference during training:- Generates multiple completions in parallel
- Reduces training time through batched generation
- Enables efficient policy rollouts
Deployment options
After fine-tuning, you can deploy your model to:- Mobile: Android and iOS apps using the LEAP SDK
- Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
- Cloud: vLLM, Modal, Baseten, Fal for production deployments
- Edge: On-device inference for low-latency applications
Run the tutorial
You can run this tutorial in two ways:- Google Colab (recommended): Click the “Open in Colab” badge at the top
- Local environment: Clone the LFM Cookbook repository and run the notebook locally
Access the notebook
The complete notebook is available at:- GitHub: grpo_with_unsloth.ipynb
- Colab: Click the badge above to open directly in Google Colab
Expected results
After GRPO training, you can expect:- Improved accuracy on mathematical reasoning tasks (typically 10-20% improvement)
- Better reasoning chains with clearer step-by-step solutions
- Reduced hallucinations on verifiable answers
- More consistent outputs following the trained format
Next steps
After completing this tutorial, you can:- Try GRPO for other verifiable tasks
- Apply the same approach to code generation or structured output tasks
- Experiment with different reward function designs
- Deploy your model using the inference guides