Implementing Reward Functions for verl RL Training

Reward functions are the bridge between model outputs and the RL training signal. verl supports two reward approaches — rule-based (verifiable) and model-based — and provides a clean RewardManager abstraction that routes each sample to the right scoring function based on its data_source field. You can use the pre-built reward functions for common math benchmarks, or implement your own scoring logic in a standalone Python file and register it through config.

Two Reward Approaches

Rule-Based (Verifiable)
Model-Based

The response is scored programmatically against a ground-truth string extracted from the dataset. This approach is fast (no GPU required), deterministic, and works well for tasks with objectively correct answers such as math, code execution, and factual question answering.Examples: exact-match on a numeric answer, regex extraction followed by string comparison, unit test execution for code.

A separately trained reward model evaluates each response and produces a scalar score. This is required for tasks where correctness is subjective or hard to verify programmatically, such as open-ended dialogue helpfulness or creative writing.verl supports loading a reward model via reward.reward_model.* configuration and running it through a dedicated RewardWorkerGroup.

Both approaches can be combined: the RewardManager sums or blends scores from multiple sources before returning the final per-token reward tensor used in policy optimization.

RewardManager

The RewardManager is the component that sits between the trainer and the scoring logic. It is instantiated in verl/trainer/main_ppo.py and called at each RL step to score a batch of generated responses. When called, RewardManager.__call__(data: DataProto) performs the following:

If the DataProto already contains reward model scores from a RewardWorkerGroup, those are returned directly.
Otherwise, for each sample in the batch, it:
- Extracts the data_source and ground_truth from data.non_tensor_batch.
- Decodes the response token IDs to a string using the tokenizer.
- Calls compute_score_fn(data_source, solution_str, ground_truth, extra_info).
- Writes the scalar score into the last valid token position of the reward tensor.
Returns a token-level reward tensor of shape (batch_size, response_length).

Inputs Available to the Reward Function

The DataProto passed through the reward pipeline contains the following fields:

Field	Location	Description
`input_ids`	`batch`	Tokenized prompt + response (after applying chat template)
`attention_mask`	`batch`	Attention mask for `input_ids`
`responses`	`batch`	Response token IDs only
`prompts`	`batch`	Prompt token IDs only
`ground_truth`	`non_tensor_batch`	Answer string from `reward_model.ground_truth` in the Parquet file
`data_source`	`non_tensor_batch`	Dataset identifier string from the Parquet field

Custom Reward Function Signature

A custom reward function is a plain Python function with the following signature:

def compute_score(
    data_source: str,
    solution_str: str,
    ground_truth: str,
    extra_info: dict = None,
) -> float:
    """Return a scalar reward score for a single response.

    Args:
        data_source:   The dataset identifier (matches the `data_source` field in your Parquet file).
        solution_str:  The decoded response string (everything after the prompt).
        ground_truth:  The ground truth answer extracted from `reward_model.ground_truth`.
        extra_info:    Additional metadata from the `extra_info` field of the Parquet file.

    Returns:
        A scalar float reward. Typically in [0.0, 1.0] or [-1.0, 1.0].
    """
    ...

The function receives one sample at a time and must return a single scalar float. There is no requirement to handle batching — the RewardManager loops over the batch.

Registering via Config

Point verl at your reward function file using custom_reward_function in the trainer config:

reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: compute_score

If you name your function compute_score and only run a single experiment, you can leave name unset — it defaults to compute_score. To run multiple experiments with different functions defined in the same file, set name explicitly for each run:

# Experiment A
reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: strict_scorer

# Experiment B
reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: lenient_scorer

Pre-Built Reward Functions

verl ships ready-to-use reward functions for the most common math reasoning benchmarks in verl/utils/reward_score/.

GSM8K

The GSM8K reward function (verl/utils/reward_score/gsm8k.py) enforces a strict output format — the model must emit its final answer after #### — and compares it against the extracted ground truth:

def compute_score(
    solution_str: str,
    ground_truth: str,
    method: str = "strict",
    format_score: float = 0.0,
    score: float = 1.0,
) -> float:

Scoring logic:

Condition	Score
Answer extracted and matches `ground_truth` exactly	`1.0`
Answer in correct `#### <number>` format but wrong value	`format_score` (default `0.0`)
No `#### <number>` pattern found	`0`

The method="strict" mode tests both format and correctness. The method="flexible" mode accepts any trailing number as the answer, which is more lenient about formatting.

MATH

The MATH reward function (verl/utils/reward_score/math_reward.py) follows the scoring implementation from the lm-evaluation-harness repository, which handles LaTeX normalization and symbolic equivalence checking for competition mathematics.

Writing a Custom Reward Function

Example: Binary Exact Match

import re

def compute_score(data_source, solution_str, ground_truth, extra_info=None):
    """Binary reward: 1.0 if the answer matches, 0.0 otherwise."""
    match = re.search(r"Answer:\s*(.*)", solution_str)
    if match and match.group(1).strip() == ground_truth.strip():
        return 1.0
    return 0.0

Example: Length-Normalized Reward

def compute_score(data_source, solution_str, ground_truth, extra_info=None):
    """Reward proportional to the fraction of the response used (demo only)."""
    return len(solution_str) / 100

Example: Multi-Dataset Dispatch

If your training set mixes multiple datasets, a single compute_score function can dispatch to different scoring logic based on data_source:

import re

def compute_score(data_source, solution_str, ground_truth, extra_info=None):
    if data_source == "openai/gsm8k":
        return _score_gsm8k(solution_str, ground_truth)
    elif data_source == "lighteval/MATH":
        return _score_math(solution_str, ground_truth)
    else:
        raise ValueError(f"Unknown data_source: {data_source}")

def _score_gsm8k(solution_str, ground_truth):
    solutions = re.findall(r"#### (\-?[0-9\.\,]+)", solution_str[-300:])
    if not solutions:
        return 0.0
    answer = solutions[-1].replace(",", "").replace("$", "")
    return 1.0 if answer == ground_truth else 0.0

def _score_math(solution_str, ground_truth):
    # ... symbolic equivalence check
    return 0.0

Model-Based Rewards

To use a trained reward model instead of (or in addition to) a rule-based function, configure the reward.reward_model block:

reward:
  reward_model:
    enable: true
    enable_resource_pool: true   # allocate a dedicated GPU pool for the reward model
    n_gpus_per_node: 4
    nnodes: 1
    model_path: /path/to/reward_model
    rollout:
      name: vllm
      dtype: bfloat16
      gpu_memory_utilization: 0.5
      tensor_model_parallel_size: 2

When enable: true, verl spins up a RewardWorkerGroup that loads the reward model and runs inference on each response batch. The scalar scores are written into the DataProto and the RewardManager returns them directly without calling a compute_score function.

Reward Scaling and Stability

The reward signal is used directly in the PPO advantage computation. Raw scores that are very large (e.g., token counts in the hundreds) will produce large advantages and destabilize policy gradient updates. Normalize your reward to a bounded range such as [0.0, 1.0] or [-1.0, 1.0] before returning it from compute_score.

verl adds a KL penalty between the current actor policy and the reference policy when computing the final reward used for advantage estimation. The per-token KL term is subtracted from (or added to) your scalar reward before PPO update. If your reward function already encodes dense per-step feedback, consider reducing actor_rollout_ref.kl_coeff to avoid double-penalizing the policy for deviation from the reference.

Debugging Reward Functions

The NaiveRewardManager prints decoded responses to the console for the first num_examine batches. This is controlled by:

reward:
  num_workers: 8   # parallel CPU workers for reward computation

You can also test your compute_score function in isolation before plugging it into training:

from my_reward import compute_score

score = compute_score(
    data_source="openai/gsm8k",
    solution_str="Let's think step by step. The answer is 42. #### 42",
    ground_truth="42",
)
print(score)  # 1.0

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Implementing Reward Functions for verl RL Training

Two Reward Approaches

RewardManager

Inputs Available to the Reward Function

Custom Reward Function Signature

Registering via Config

Pre-Built Reward Functions

GSM8K

MATH

Writing a Custom Reward Function

Example: Binary Exact Match

Example: Length-Normalized Reward

Example: Multi-Dataset Dispatch

Model-Based Rewards

Reward Scaling and Stability

Debugging Reward Functions

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Two Reward Approaches

​RewardManager

​Inputs Available to the Reward Function

​Custom Reward Function Signature

​Registering via Config

​Pre-Built Reward Functions

​GSM8K

​MATH

​Writing a Custom Reward Function

​Example: Binary Exact Match

​Example: Length-Normalized Reward

​Example: Multi-Dataset Dispatch

​Model-Based Rewards

​Reward Scaling and Stability

​Debugging Reward Functions

Build docs developers (and LLMs) love

Two Reward Approaches

RewardManager

Inputs Available to the Reward Function

Custom Reward Function Signature

Registering via Config

Pre-Built Reward Functions

GSM8K

MATH

Writing a Custom Reward Function

Example: Binary Exact Match

Example: Length-Normalized Reward

Example: Multi-Dataset Dispatch

Model-Based Rewards

Reward Scaling and Stability

Debugging Reward Functions