Skip to main content
Open In Colab This tutorial shows you how to use Group Relative Policy Optimization (GRPO) for reinforcement learning-based fine-tuning with Unsloth. We’ll train the LFM2.5-1.2B-Instruct model on mathematical reasoning tasks using the Open R1 Math dataset.

What you’ll learn

By the end of this tutorial, you’ll know how to:
  • Understand when to use GRPO vs supervised fine-tuning
  • Set up your environment for GRPO training
  • Prepare datasets for reinforcement learning
  • Design reward functions for verifiable tasks
  • Train models with GRPO for improved reasoning
  • Test and evaluate your fine-tuned model
  • Export models for deployment

Prerequisites

  • GPU: This tutorial requires a GPU with at least 16GB memory. You can run it for free on Google Colab using an NVIDIA T4 GPU
  • Python: Python 3.8 or higher
  • Basic knowledge: Understanding of reinforcement learning concepts is helpful

What is GRPO?

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that:
  • Trains models by rewarding correct outputs and penalizing incorrect ones
  • Uses programmatic verification to evaluate model responses
  • Improves performance on tasks with clear success criteria
  • Is more sample-efficient than traditional RL approaches

When to use GRPO

GRPO is ideal for verifiable tasks where you can programmatically evaluate correctness:
  • Mathematical reasoning: Check if the final answer matches the expected result
  • Code generation: Run unit tests to verify code correctness
  • Structured output: Validate JSON/SQL against schemas
  • Question answering: Compare against ground truth answers
Don’t use GRPO for:
  • Open-ended creative tasks without clear success metrics
  • Subjective tasks like style transfer or creative writing
  • Simple instruction following (use SFT instead)

Tutorial overview

The tutorial covers the following steps:
  1. Installation: Set up Unsloth, vLLM, and required dependencies
  2. Pre-training phase: Format adaptation through SFT
  3. Data preparation: Load and format the Open R1 Math dataset
  4. Reward function: Design a reward function for mathematical correctness
  5. GRPO training: Configure and run reinforcement learning training
  6. Inference: Test your model’s reasoning capabilities
  7. Export: Save your model for deployment

Key concepts

Two-phase training approach

The tutorial uses a two-phase approach: Phase 1 - Format adaptation (SFT):
  • Pre-trains the model on the output format
  • Helps GRPO focus on correctness rather than formatting
  • Speeds up overall training time
Phase 2 - GRPO training:
  • Uses reward signals to improve reasoning
  • Learns from both successful and failed attempts
  • Optimizes for task accuracy

Reward function design

The reward function evaluates model outputs:
def reward_function(response, expected_answer):
    # Extract final answer from response
    predicted = extract_answer(response)
    # Compare with expected answer
    if predicted == expected_answer:
        return 1.0  # Correct answer
    else:
        return 0.0  # Incorrect answer
Good reward functions should:
  • Provide clear signals (correct vs incorrect)
  • Be deterministic and consistent
  • Execute quickly for training efficiency

vLLM integration

The tutorial uses vLLM for efficient inference during training:
  • Generates multiple completions in parallel
  • Reduces training time through batched generation
  • Enables efficient policy rollouts

Deployment options

After fine-tuning, you can deploy your model to:
  • Mobile: Android and iOS apps using the LEAP SDK
  • Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
  • Cloud: vLLM, Modal, Baseten, Fal for production deployments
  • Edge: On-device inference for low-latency applications
See the deployment documentation for detailed guides.

Run the tutorial

You can run this tutorial in two ways:
  1. Google Colab (recommended): Click the “Open in Colab” badge at the top
  2. Local environment: Clone the LFM Cookbook repository and run the notebook locally

Access the notebook

The complete notebook is available at:

Expected results

After GRPO training, you can expect:
  • Improved accuracy on mathematical reasoning tasks (typically 10-20% improvement)
  • Better reasoning chains with clearer step-by-step solutions
  • Reduced hallucinations on verifiable answers
  • More consistent outputs following the trained format

Next steps

After completing this tutorial, you can:

Getting help

Need assistance? Join the Liquid AI Discord Community: Join Discord

Build docs developers (and LLMs) love