GRPO fine-tuning with Unsloth

This tutorial shows you how to use Group Relative Policy Optimization (GRPO) for reinforcement learning-based fine-tuning with Unsloth. We’ll train the LFM2.5-1.2B-Instruct model on mathematical reasoning tasks using the Open R1 Math dataset.

What you’ll learn

By the end of this tutorial, you’ll know how to:

Understand when to use GRPO vs supervised fine-tuning
Set up your environment for GRPO training
Prepare datasets for reinforcement learning
Design reward functions for verifiable tasks
Train models with GRPO for improved reasoning
Test and evaluate your fine-tuned model
Export models for deployment

Prerequisites

GPU: This tutorial requires a GPU with at least 16GB memory. You can run it for free on Google Colab using an NVIDIA T4 GPU
Python: Python 3.8 or higher
Basic knowledge: Understanding of reinforcement learning concepts is helpful

What is GRPO?

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that:

Trains models by rewarding correct outputs and penalizing incorrect ones
Uses programmatic verification to evaluate model responses
Improves performance on tasks with clear success criteria
Is more sample-efficient than traditional RL approaches

When to use GRPO

GRPO is ideal for verifiable tasks where you can programmatically evaluate correctness:

Mathematical reasoning: Check if the final answer matches the expected result
Code generation: Run unit tests to verify code correctness
Structured output: Validate JSON/SQL against schemas
Question answering: Compare against ground truth answers

Don’t use GRPO for:

Open-ended creative tasks without clear success metrics
Subjective tasks like style transfer or creative writing
Simple instruction following (use SFT instead)

Tutorial overview

The tutorial covers the following steps:

Installation: Set up Unsloth, vLLM, and required dependencies
Pre-training phase: Format adaptation through SFT
Data preparation: Load and format the Open R1 Math dataset
Reward function: Design a reward function for mathematical correctness
GRPO training: Configure and run reinforcement learning training
Inference: Test your model’s reasoning capabilities
Export: Save your model for deployment

Key concepts

Two-phase training approach

The tutorial uses a two-phase approach: Phase 1 - Format adaptation (SFT):

Pre-trains the model on the output format
Helps GRPO focus on correctness rather than formatting
Speeds up overall training time

Phase 2 - GRPO training:

Uses reward signals to improve reasoning
Learns from both successful and failed attempts
Optimizes for task accuracy

Reward function design

The reward function evaluates model outputs:

def reward_function(response, expected_answer):
    # Extract final answer from response
    predicted = extract_answer(response)
    # Compare with expected answer
    if predicted == expected_answer:
        return 1.0  # Correct answer
    else:
        return 0.0  # Incorrect answer

Good reward functions should:

Provide clear signals (correct vs incorrect)
Be deterministic and consistent
Execute quickly for training efficiency

vLLM integration

The tutorial uses vLLM for efficient inference during training:

Generates multiple completions in parallel
Reduces training time through batched generation
Enables efficient policy rollouts

Deployment options

After fine-tuning, you can deploy your model to:

Mobile: Android and iOS apps using the LEAP SDK
Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
Cloud: vLLM, Modal, Baseten, Fal for production deployments
Edge: On-device inference for low-latency applications

See the deployment documentation for detailed guides.

Run the tutorial

You can run this tutorial in two ways:

Google Colab (recommended): Click the “Open in Colab” badge at the top
Local environment: Clone the LFM Cookbook repository and run the notebook locally

Access the notebook

The complete notebook is available at:

GitHub: grpo_with_unsloth.ipynb
Colab: Click the badge above to open directly in Google Colab

Expected results

After GRPO training, you can expect:

Improved accuracy on mathematical reasoning tasks (typically 10-20% improvement)
Better reasoning chains with clearer step-by-step solutions
Reduced hallucinations on verifiable answers
More consistent outputs following the trained format

Next steps

After completing this tutorial, you can:

Try GRPO for other verifiable tasks
Apply the same approach to code generation or structured output tasks
Experiment with different reward function designs
Deploy your model using the inference guides

Getting help

Need assistance? Join the Liquid AI Discord Community:

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

GRPO fine-tuning with Unsloth

What you’ll learn

Prerequisites

What is GRPO?

When to use GRPO

Tutorial overview

Key concepts

Two-phase training approach

Reward function design

vLLM integration

Deployment options

Run the tutorial

Access the notebook

Expected results

Next steps

Getting help

Build docs developers (and LLMs) love

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Documentation Index

​What you’ll learn

​Prerequisites

​What is GRPO?

​When to use GRPO

​Tutorial overview

​Key concepts

​Two-phase training approach

​Reward function design

​vLLM integration

​Deployment options

​Run the tutorial

​Access the notebook

​Expected results

​Next steps

​Getting help

Build docs developers (and LLMs) love

What you’ll learn

Prerequisites

What is GRPO?

When to use GRPO

Tutorial overview

Key concepts

Two-phase training approach

Reward function design

vLLM integration

Deployment options

Run the tutorial

Access the notebook

Expected results

Next steps

Getting help