Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt

Use this file to discover all available pages before exploring further.

Reward functions are the primary training signal in GRPO pipelines. This repo defines reward functions across three modules — math_reasoning, multi_hop_question_answering, and medical_question_answering — each composed of correctness and format functions. All reward functions are BaseReward subclasses: callable instances that return one scalar score per generated completion. They are passed directly to GRPOTrainer(reward_funcs=[...]) and called once per batch during training.

BaseReward interface

BaseReward is the abstract base class for all reward functions, defined in src/llm_finetuning/core/reward.py.
class BaseReward(ABC):
    def __init__(self, config: RewardConfig | None = None) -> None: ...

    @abstractmethod
    def __call__(
        self,
        prompts: list,
        completions: list,
        **kwargs: Any,
    ) -> list[float]: ...
prompts
list
required
Batch of prompt histories. Each item is a list of chat message dicts leading up to the assistant turn.
completions
list
required
Batch of generated assistant message lists aligned with prompts. Each item is a list containing one dict: {"role": "assistant", "content": "..."}. Access the response text via completion[0]["content"].
**kwargs
Any
Extra per-row dataset columns forwarded by GRPOTrainer. For example, if the dataset has an answer column, it is available as kwargs["answer"] — a list of ground-truth values, one per completion.
Return type: list[float] — one reward score per completion, ordered to match completions. GRPO normalises scores within each generation group to compute relative advantages before updating the policy.
To use a reward function in contexts that require a picklable callable, call .as_fn() on the instance. This wraps __call__ in a plain function with the same __name__.

Reward functions by module

Five reward functions are composed for GSM8K training in math_reasoning/grpo/gsm8k/. The dominant signal is AnswerCorrectnessReward; the four format functions guide structure.

Correctness

AnswerCorrectnessReward

File: math_reasoning/reward_functions/correctness/answer_correctness.py
Score range: −1.0 to 3.0
Extracts the predicted answer from <answer>...</answer> tags in the completion, falling back to a #### {number} pattern. Compares the extracted value against kwargs["answer"] (the ground-truth numeric string from GSM8K).
ConditionScore
Exact string match3.0
Numeric value within 10% of ground truth0.5
Numeric value within 20% of ground truth0.25
Wrong answer (parseable but out of range)−1.0
No parseable answer found0.0
ValueError or ZeroDivisionError during numeric comparison−0.5

Format

ReasoningTagsReward

File: math_reasoning/reward_functions/format/reasoning_tags.py
Score range: −2.0 to 2.0
Counts occurrences of each of the four required tags: <reasoning>, </reasoning>, <answer>, </answer>. Expects exactly one of each.
  • +0.5 per tag whose count equals 1
  • −0.5 per tag whose count does not equal 1
Maximum score (all four tags present exactly once): 2.0. Minimum (all four wrong): −2.0.

StepFormatReward

File: math_reasoning/reward_functions/format/step_format.py
Score range: 0.0 to 1.0
Counts lines matching any step pattern: Step N, N., or bullet characters (-, *, ).
Steps foundScore
≥ 31.0
< 3count / 3

MultilineComplianceReward

File: math_reasoning/reward_functions/format/multiline_compliance.py
Score range: 0.0 to 1.0
Counts non-empty lines in the completion.
Non-empty linesScore
≥ 51.0
< 5count / 5

ResponseStructureReward

File: math_reasoning/reward_functions/format/response_structure.py
Score range: 0.0 to 1.0
Regex-matches complete <reasoning>...</reasoning> and <answer>...</answer> blocks (content required between tags). Partial credit is awarded if only one block is present.
ConditionScore
Both <reasoning> and <answer> blocks present1.0
Only <reasoning> block0.5
Only <answer> block0.5
Neither block present0.0

Composing reward functions in GRPOTrainer

Pass a list of instantiated reward functions to GRPOTrainer. GRPO calls each function once per batch and sums the resulting scores per completion before normalising within each generation group.
from llm_finetuning.math_reasoning.reward_functions.correctness.answer_correctness import (
    AnswerCorrectnessReward,
)
from llm_finetuning.math_reasoning.reward_functions.format.reasoning_tags import (
    ReasoningTagsReward,
)
from llm_finetuning.math_reasoning.reward_functions.format.step_format import (
    StepFormatReward,
)
from llm_finetuning.math_reasoning.reward_functions.format.multiline_compliance import (
    MultilineComplianceReward,
)
from llm_finetuning.math_reasoning.reward_functions.format.response_structure import (
    ResponseStructureReward,
)

trainer = GRPOTrainer(
    reward_funcs=[
        AnswerCorrectnessReward(),
        ReasoningTagsReward(),
        StepFormatReward(),
        MultilineComplianceReward(),
        ResponseStructureReward(),
    ],
    ...
)

Unit testing guidance

  • Create small prompts/completions fixtures that mirror the TRL structure:
    completions = [[{"role": "assistant", "content": "<reasoning>Step 1...</reasoning><answer>42</answer>"}]]
    prompts = [[{"role": "user", "content": "What is 6 × 7?"}]]
    
  • Assert the returned list has length equal to len(completions).
  • Test edge cases: empty content, missing tags, boundary values for numeric scores.
  • For LLM-judge rewards (DeepEval, Evidently), mock the external API call in unit tests to avoid requiring OPENAI_API_KEY in CI.

Build docs developers (and LLMs) love