GRPO for verifiable tasks

This tutorial shows you how to fine-tune language models for verifiable tasks using Group Relative Policy Optimization (GRPO). Learn how to apply reinforcement learning to tasks where outputs can be programmatically verified for correctness.

What you’ll learn

By the end of this tutorial, you’ll know how to:

Identify tasks suitable for GRPO training
Design verification functions for different task types
Set up TRL for GRPO training
Configure training parameters for optimal results
Evaluate model performance on verifiable tasks
Export and deploy your fine-tuned model

Prerequisites

GPU: This tutorial requires a GPU with at least 16GB memory. You can run it for free on Google Colab using an NVIDIA T4 GPU
Python: Python 3.8 or higher
Basic knowledge: Understanding of reinforcement learning concepts is helpful

What are verifiable tasks?

Verifiable tasks are problems where you can programmatically check if an output is correct. Examples include:

Mathematical problem solving

Extract the final numeric answer
Compare with ground truth
Binary verification (correct/incorrect)

def verify_math(output, expected):
    answer = extract_number(output)
    return answer == expected

Code generation

Run unit tests against generated code
Check for compilation errors
Verify output matches expected results

def verify_code(code, test_cases):
    result = run_tests(code, test_cases)
    return result.all_passed

Structured output tasks

Validate JSON against schema
Check SQL syntax and semantics
Verify XML/HTML structure

def verify_json(output, schema):
    try:
        data = json.loads(output)
        jsonschema.validate(data, schema)
        return True
    except:
        return False

Question answering

Compare with ground truth answers
Check for exact match or semantic similarity
Verify factual correctness

Why use GRPO for these tasks?

GRPO offers several advantages for verifiable tasks:

Direct optimization: Optimizes directly for task success, not just likelihood
Sample efficiency: Learns from both successes and failures
Better generalization: Encourages diverse solution strategies
Reduced overfitting: Focuses on correctness rather than memorization

Tutorial overview

The tutorial covers the following steps:

Installation: Set up TRL, PEFT, and required libraries
Task selection: Choose a verifiable task (mathematical reasoning)
Data preparation: Format data with verification labels
Verification function: Implement programmatic checking
Training setup: Configure GRPO with TRL
Training: Run reinforcement learning training
Evaluation: Test on held-out verification set
Export: Save your model for deployment

Key concepts

GRPO algorithm

GRPO works by:

Generating multiple candidate outputs
Evaluating each with the verification function
Computing rewards based on correctness
Updating the policy to increase reward

Reward design principles

Good verification functions should be:

Deterministic: Same input always gives same result
Fast: Executes quickly for training efficiency
Accurate: Correctly identifies success and failure
Unbiased: Doesn’t favor specific solution patterns

Training configuration

Key hyperparameters for GRPO:

Number of generations: How many outputs to generate per prompt (typically 4-8)
Learning rate: Lower than SFT (typically 1e-6 to 1e-5)
Batch size: Smaller due to multiple generations
Reward scaling: Normalize rewards for stable training

Libraries used

The tutorial uses:

TRL: Hugging Face library for reinforcement learning
PEFT: Parameter-efficient fine-tuning with LoRA
bitsandbytes: Quantization for memory efficiency
liger-kernel: Optimized training kernels
trackio: Experiment tracking (optional)

Deployment options

After fine-tuning, you can deploy your model to:

Mobile: Android and iOS apps using the LEAP SDK
Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
Cloud: vLLM, Modal, Baseten, Fal for production deployments
Edge: On-device inference for low-latency applications

See the deployment documentation for detailed guides.

Run the tutorial

You can run this tutorial in two ways:

Google Colab (recommended): Click the “Open in Colab” badge at the top
Local environment: Clone the LFM Cookbook repository and run the notebook locally

Access the notebook

The complete notebook is available at:

GitHub: grpo_for_verifiable_tasks.ipynb
Colab: Click the badge above to open directly in Google Colab

Adapting to your task

To apply GRPO to your own verifiable task:

Define verification: Write a function that checks correctness
Prepare data: Include verification labels in your dataset
Adjust rewards: Scale rewards appropriately for your task
Tune hyperparameters: Experiment with generation count and learning rate
Evaluate thoroughly: Test on diverse examples

Next steps

After completing this tutorial, you can:

Try GRPO with Unsloth for optimized training
Apply GRPO to code generation or structured output tasks
Experiment with different verification functions
Deploy your model using the inference guides

Getting help

Need assistance? Join the Liquid AI Discord Community:

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

GRPO for verifiable tasks

What you’ll learn

Prerequisites

What are verifiable tasks?

Mathematical problem solving

Code generation

Structured output tasks

Question answering

Why use GRPO for these tasks?

Tutorial overview

Key concepts

GRPO algorithm

Reward design principles

Training configuration

Libraries used

Deployment options

Run the tutorial

Access the notebook

Adapting to your task

Next steps

Getting help

Build docs developers (and LLMs) love

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Documentation Index

​What you’ll learn

​Prerequisites

​What are verifiable tasks?

​Mathematical problem solving

​Code generation

​Structured output tasks

​Question answering

​Why use GRPO for these tasks?

​Tutorial overview

​Key concepts

​GRPO algorithm

​Reward design principles

​Training configuration

​Libraries used

​Deployment options

​Run the tutorial

​Access the notebook

​Adapting to your task

​Next steps

​Getting help

Build docs developers (and LLMs) love

What you’ll learn

Prerequisites

What are verifiable tasks?

Mathematical problem solving

Code generation

Structured output tasks

Question answering

Why use GRPO for these tasks?

Tutorial overview

Key concepts

GRPO algorithm

Reward design principles

Training configuration

Libraries used

Deployment options

Run the tutorial

Access the notebook

Adapting to your task

Next steps

Getting help