What you’ll learn
By the end of this tutorial, you’ll know how to:- Identify tasks suitable for GRPO training
- Design verification functions for different task types
- Set up TRL for GRPO training
- Configure training parameters for optimal results
- Evaluate model performance on verifiable tasks
- Export and deploy your fine-tuned model
Prerequisites
- GPU: This tutorial requires a GPU with at least 16GB memory. You can run it for free on Google Colab using an NVIDIA T4 GPU
- Python: Python 3.8 or higher
- Basic knowledge: Understanding of reinforcement learning concepts is helpful
What are verifiable tasks?
Verifiable tasks are problems where you can programmatically check if an output is correct. Examples include:Mathematical problem solving
- Extract the final numeric answer
- Compare with ground truth
- Binary verification (correct/incorrect)
Code generation
- Run unit tests against generated code
- Check for compilation errors
- Verify output matches expected results
Structured output tasks
- Validate JSON against schema
- Check SQL syntax and semantics
- Verify XML/HTML structure
Question answering
- Compare with ground truth answers
- Check for exact match or semantic similarity
- Verify factual correctness
Why use GRPO for these tasks?
GRPO offers several advantages for verifiable tasks:- Direct optimization: Optimizes directly for task success, not just likelihood
- Sample efficiency: Learns from both successes and failures
- Better generalization: Encourages diverse solution strategies
- Reduced overfitting: Focuses on correctness rather than memorization
Tutorial overview
The tutorial covers the following steps:- Installation: Set up TRL, PEFT, and required libraries
- Task selection: Choose a verifiable task (mathematical reasoning)
- Data preparation: Format data with verification labels
- Verification function: Implement programmatic checking
- Training setup: Configure GRPO with TRL
- Training: Run reinforcement learning training
- Evaluation: Test on held-out verification set
- Export: Save your model for deployment
Key concepts
GRPO algorithm
GRPO works by:- Generating multiple candidate outputs
- Evaluating each with the verification function
- Computing rewards based on correctness
- Updating the policy to increase reward
Reward design principles
Good verification functions should be:- Deterministic: Same input always gives same result
- Fast: Executes quickly for training efficiency
- Accurate: Correctly identifies success and failure
- Unbiased: Doesn’t favor specific solution patterns
Training configuration
Key hyperparameters for GRPO:- Number of generations: How many outputs to generate per prompt (typically 4-8)
- Learning rate: Lower than SFT (typically 1e-6 to 1e-5)
- Batch size: Smaller due to multiple generations
- Reward scaling: Normalize rewards for stable training
Libraries used
The tutorial uses:- TRL: Hugging Face library for reinforcement learning
- PEFT: Parameter-efficient fine-tuning with LoRA
- bitsandbytes: Quantization for memory efficiency
- liger-kernel: Optimized training kernels
- trackio: Experiment tracking (optional)
Deployment options
After fine-tuning, you can deploy your model to:- Mobile: Android and iOS apps using the LEAP SDK
- Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
- Cloud: vLLM, Modal, Baseten, Fal for production deployments
- Edge: On-device inference for low-latency applications
Run the tutorial
You can run this tutorial in two ways:- Google Colab (recommended): Click the “Open in Colab” badge at the top
- Local environment: Clone the LFM Cookbook repository and run the notebook locally
Access the notebook
The complete notebook is available at:- GitHub: grpo_for_verifiable_tasks.ipynb
- Colab: Click the badge above to open directly in Google Colab
Adapting to your task
To apply GRPO to your own verifiable task:- Define verification: Write a function that checks correctness
- Prepare data: Include verification labels in your dataset
- Adjust rewards: Scale rewards appropriately for your task
- Tune hyperparameters: Experiment with generation count and learning rate
- Evaluate thoroughly: Test on diverse examples
Next steps
After completing this tutorial, you can:- Try GRPO with Unsloth for optimized training
- Apply GRPO to code generation or structured output tasks
- Experiment with different verification functions
- Deploy your model using the inference guides