Reward functions are the primary training signal in GRPO pipelines. This repo defines reward functions across three modules —Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt
Use this file to discover all available pages before exploring further.
math_reasoning, multi_hop_question_answering, and medical_question_answering — each composed of correctness and format functions. All reward functions are BaseReward subclasses: callable instances that return one scalar score per generated completion. They are passed directly to GRPOTrainer(reward_funcs=[...]) and called once per batch during training.
BaseReward interface
BaseReward is the abstract base class for all reward functions, defined in src/llm_finetuning/core/reward.py.
Batch of prompt histories. Each item is a list of chat message dicts leading up to the assistant turn.
Batch of generated assistant message lists aligned with
prompts. Each item is a list containing one dict: {"role": "assistant", "content": "..."}. Access the response text via completion[0]["content"].Extra per-row dataset columns forwarded by
GRPOTrainer. For example, if the dataset has an answer column, it is available as kwargs["answer"] — a list of ground-truth values, one per completion.list[float] — one reward score per completion, ordered to match completions. GRPO normalises scores within each generation group to compute relative advantages before updating the policy.
Reward functions by module
- Math Reasoning (5 functions)
- Multi-Hop QA (8 functions)
- Medical QA (8 functions)
Five reward functions are composed for GSM8K training in
File:
Score range: −1.0 to 3.0Extracts the predicted answer from
File:
Score range: −2.0 to 2.0Counts occurrences of each of the four required tags:
File:
Score range: 0.0 to 1.0Counts lines matching any step pattern:
File:
Score range: 0.0 to 1.0Counts non-empty lines in the completion.
File:
Score range: 0.0 to 1.0Regex-matches complete
math_reasoning/grpo/gsm8k/. The dominant signal is AnswerCorrectnessReward; the four format functions guide structure.Correctness
AnswerCorrectnessReward
File: math_reasoning/reward_functions/correctness/answer_correctness.pyScore range: −1.0 to 3.0Extracts the predicted answer from
<answer>...</answer> tags in the completion, falling back to a #### {number} pattern. Compares the extracted value against kwargs["answer"] (the ground-truth numeric string from GSM8K).| Condition | Score |
|---|---|
| Exact string match | 3.0 |
| Numeric value within 10% of ground truth | 0.5 |
| Numeric value within 20% of ground truth | 0.25 |
| Wrong answer (parseable but out of range) | −1.0 |
| No parseable answer found | 0.0 |
| ValueError or ZeroDivisionError during numeric comparison | −0.5 |
Format
ReasoningTagsReward
File: math_reasoning/reward_functions/format/reasoning_tags.pyScore range: −2.0 to 2.0Counts occurrences of each of the four required tags:
<reasoning>, </reasoning>, <answer>, </answer>. Expects exactly one of each.- +0.5 per tag whose count equals 1
- −0.5 per tag whose count does not equal 1
StepFormatReward
File: math_reasoning/reward_functions/format/step_format.pyScore range: 0.0 to 1.0Counts lines matching any step pattern:
Step N, N., or bullet characters (-, *, •).| Steps found | Score |
|---|---|
| ≥ 3 | 1.0 |
| < 3 | count / 3 |
MultilineComplianceReward
File: math_reasoning/reward_functions/format/multiline_compliance.pyScore range: 0.0 to 1.0Counts non-empty lines in the completion.
| Non-empty lines | Score |
|---|---|
| ≥ 5 | 1.0 |
| < 5 | count / 5 |
ResponseStructureReward
File: math_reasoning/reward_functions/format/response_structure.pyScore range: 0.0 to 1.0Regex-matches complete
<reasoning>...</reasoning> and <answer>...</answer> blocks (content required between tags). Partial credit is awarded if only one block is present.| Condition | Score |
|---|---|
Both <reasoning> and <answer> blocks present | 1.0 |
Only <reasoning> block | 0.5 |
Only <answer> block | 0.5 |
| Neither block present | 0.0 |
Composing reward functions in GRPOTrainer
Pass a list of instantiated reward functions toGRPOTrainer. GRPO calls each function once per batch and sums the resulting scores per completion before normalising within each generation group.
Unit testing guidance
-
Create small
prompts/completionsfixtures that mirror the TRL structure: -
Assert the returned list has length equal to
len(completions). - Test edge cases: empty content, missing tags, boundary values for numeric scores.
-
For LLM-judge rewards (
DeepEval,Evidently), mock the external API call in unit tests to avoid requiringOPENAI_API_KEYin CI.