Group Relative Policy Optimization (GRPO) trains models to generate step-by-step mathematical reasoning by scoring groups of completions relative to each other — no explicit value network needed. This pipeline fine-tunes models on GSM8K, a dataset of 7,473 grade-school math word problems, using one correctness reward and four format rewards that together enforce structured, verifiable solutions. A two-stage variant is also available: first priming a Qwen3-4B-Base model on OpenR1-Math-220k to teach the output format, then applying GRPO on GSM8K.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt
Use this file to discover all available pages before exploring further.
Supported models
Five model configurations are provided out of the box. Switch between them by updatingmodel_id in config.yaml.
| Model | HuggingFace ID |
|---|---|
| Llama-3.2-3B (default) | unsloth/Llama-3.2-3B-Instruct |
| Phi-4 | unsloth/phi-4 |
| Mistral-7B | unsloth/mistral-7b-instruct-v0.3-bnb-4bit |
| Llama-3.1-8B | unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit |
| Gemma3-1B | unsloth/gemma-3-1b-it |
Reward functions
The five reward functions are passed toGRPOTrainer as a list. GRPO averages the scores across the group of generated completions to compute a relative advantage signal.
| Category | Reward function | Description | Score range |
|---|---|---|---|
| Correctness | AnswerCorrectnessReward | Extracts the numeric value from <answer> tags and compares it to the ground-truth answer | −1.0 to 3.0 |
| Format | ReasoningTagsReward | Validates presence and proper nesting of <reasoning> and <answer> tags | −2.0 to 2.0 |
| Format | StepFormatReward | Rewards numbered or bulleted step-by-step structure (minimum 3 steps) | 0.0 to 1.0 |
| Format | MultilineComplianceReward | Rewards multi-line responses with sufficient depth (minimum 5 lines) | 0.0 to 1.0 |
| Format | ResponseStructureReward | Validates that both reasoning and answer blocks are non-empty | 0.0 to 1.0 |
Running the pipeline
Single-stage: GRPO on GSM8K
This is the default configuration. It runs withunsloth/Llama-3.2-3B-Instruct using config.yaml.
Two-stage pipeline: SFT then GRPO
The two-stage variant is designed forQwen3-4B-Base, which lacks an instruction-following format out of the box. Stage 1 teaches the model to produce <reasoning>/<answer> output; Stage 2 then reinforces correctness via GRPO.
Stage 1 — SFT on OpenR1-Math-220k
Fine-tunes Qwen3-4B-Base on The trained checkpoint is saved to
open-r1/OpenR1-Math-220k, filtering to examples with numeric answers within a 1024-token budget. This primes the model for structured <reasoning>/<answer> output format.output_dir defined in the SFT config and used as model_id in the GRPO stage.Configuration reference
Key parameters fromgrpo/gsm8k/config.yaml:
| Parameter | Description |
|---|---|
num_generations | Number of completions generated per prompt for relative advantage estimation |
max_prompt_length | Maximum token length for the input prompt (truncated if exceeded) |
max_completion_length | Maximum token length for generated completions |
max_grad_norm | Gradient clipping threshold; 0.1 is recommended for GRPO stability |
Training script structure
Thetrain.py script loads the model via Unsloth’s FastLanguageModel (4-bit QLoRA), builds a GRPOTrainer with all five reward functions, and runs training:
Output
Trained adapter weights and the tokenizer are saved to./outputs/math_reasoning/grpo/gsm8k/ by default. Change output_dir in config.yaml to save elsewhere.