Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

Reward functions are the bridge between model outputs and the RL training signal. verl supports two reward approaches — rule-based (verifiable) and model-based — and provides a clean RewardManager abstraction that routes each sample to the right scoring function based on its data_source field. You can use the pre-built reward functions for common math benchmarks, or implement your own scoring logic in a standalone Python file and register it through config.

Two Reward Approaches

The response is scored programmatically against a ground-truth string extracted from the dataset. This approach is fast (no GPU required), deterministic, and works well for tasks with objectively correct answers such as math, code execution, and factual question answering.Examples: exact-match on a numeric answer, regex extraction followed by string comparison, unit test execution for code.
Both approaches can be combined: the RewardManager sums or blends scores from multiple sources before returning the final per-token reward tensor used in policy optimization.

RewardManager

The RewardManager is the component that sits between the trainer and the scoring logic. It is instantiated in verl/trainer/main_ppo.py and called at each RL step to score a batch of generated responses. When called, RewardManager.__call__(data: DataProto) performs the following:
  1. If the DataProto already contains reward model scores from a RewardWorkerGroup, those are returned directly.
  2. Otherwise, for each sample in the batch, it:
    • Extracts the data_source and ground_truth from data.non_tensor_batch.
    • Decodes the response token IDs to a string using the tokenizer.
    • Calls compute_score_fn(data_source, solution_str, ground_truth, extra_info).
    • Writes the scalar score into the last valid token position of the reward tensor.
  3. Returns a token-level reward tensor of shape (batch_size, response_length).

Inputs Available to the Reward Function

The DataProto passed through the reward pipeline contains the following fields:
FieldLocationDescription
input_idsbatchTokenized prompt + response (after applying chat template)
attention_maskbatchAttention mask for input_ids
responsesbatchResponse token IDs only
promptsbatchPrompt token IDs only
ground_truthnon_tensor_batchAnswer string from reward_model.ground_truth in the Parquet file
data_sourcenon_tensor_batchDataset identifier string from the Parquet field

Custom Reward Function Signature

A custom reward function is a plain Python function with the following signature:
def compute_score(
    data_source: str,
    solution_str: str,
    ground_truth: str,
    extra_info: dict = None,
) -> float:
    """Return a scalar reward score for a single response.

    Args:
        data_source:   The dataset identifier (matches the `data_source` field in your Parquet file).
        solution_str:  The decoded response string (everything after the prompt).
        ground_truth:  The ground truth answer extracted from `reward_model.ground_truth`.
        extra_info:    Additional metadata from the `extra_info` field of the Parquet file.

    Returns:
        A scalar float reward. Typically in [0.0, 1.0] or [-1.0, 1.0].
    """
    ...
The function receives one sample at a time and must return a single scalar float. There is no requirement to handle batching — the RewardManager loops over the batch.

Registering via Config

Point verl at your reward function file using custom_reward_function in the trainer config:
reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: compute_score
If you name your function compute_score and only run a single experiment, you can leave name unset — it defaults to compute_score. To run multiple experiments with different functions defined in the same file, set name explicitly for each run:
# Experiment A
reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: strict_scorer

# Experiment B
reward:
  custom_reward_function:
    path: /path/to/my_reward.py
    name: lenient_scorer

Pre-Built Reward Functions

verl ships ready-to-use reward functions for the most common math reasoning benchmarks in verl/utils/reward_score/.

GSM8K

The GSM8K reward function (verl/utils/reward_score/gsm8k.py) enforces a strict output format — the model must emit its final answer after #### — and compares it against the extracted ground truth:
def compute_score(
    solution_str: str,
    ground_truth: str,
    method: str = "strict",
    format_score: float = 0.0,
    score: float = 1.0,
) -> float:
Scoring logic:
ConditionScore
Answer extracted and matches ground_truth exactly1.0
Answer in correct #### <number> format but wrong valueformat_score (default 0.0)
No #### <number> pattern found0
The method="strict" mode tests both format and correctness. The method="flexible" mode accepts any trailing number as the answer, which is more lenient about formatting.

MATH

The MATH reward function (verl/utils/reward_score/math_reward.py) follows the scoring implementation from the lm-evaluation-harness repository, which handles LaTeX normalization and symbolic equivalence checking for competition mathematics.

Writing a Custom Reward Function

Example: Binary Exact Match

import re

def compute_score(data_source, solution_str, ground_truth, extra_info=None):
    """Binary reward: 1.0 if the answer matches, 0.0 otherwise."""
    match = re.search(r"Answer:\s*(.*)", solution_str)
    if match and match.group(1).strip() == ground_truth.strip():
        return 1.0
    return 0.0

Example: Length-Normalized Reward

def compute_score(data_source, solution_str, ground_truth, extra_info=None):
    """Reward proportional to the fraction of the response used (demo only)."""
    return len(solution_str) / 100

Example: Multi-Dataset Dispatch

If your training set mixes multiple datasets, a single compute_score function can dispatch to different scoring logic based on data_source:
import re

def compute_score(data_source, solution_str, ground_truth, extra_info=None):
    if data_source == "openai/gsm8k":
        return _score_gsm8k(solution_str, ground_truth)
    elif data_source == "lighteval/MATH":
        return _score_math(solution_str, ground_truth)
    else:
        raise ValueError(f"Unknown data_source: {data_source}")

def _score_gsm8k(solution_str, ground_truth):
    solutions = re.findall(r"#### (\-?[0-9\.\,]+)", solution_str[-300:])
    if not solutions:
        return 0.0
    answer = solutions[-1].replace(",", "").replace("$", "")
    return 1.0 if answer == ground_truth else 0.0

def _score_math(solution_str, ground_truth):
    # ... symbolic equivalence check
    return 0.0

Model-Based Rewards

To use a trained reward model instead of (or in addition to) a rule-based function, configure the reward.reward_model block:
reward:
  reward_model:
    enable: true
    enable_resource_pool: true   # allocate a dedicated GPU pool for the reward model
    n_gpus_per_node: 4
    nnodes: 1
    model_path: /path/to/reward_model
    rollout:
      name: vllm
      dtype: bfloat16
      gpu_memory_utilization: 0.5
      tensor_model_parallel_size: 2
When enable: true, verl spins up a RewardWorkerGroup that loads the reward model and runs inference on each response batch. The scalar scores are written into the DataProto and the RewardManager returns them directly without calling a compute_score function.

Reward Scaling and Stability

The reward signal is used directly in the PPO advantage computation. Raw scores that are very large (e.g., token counts in the hundreds) will produce large advantages and destabilize policy gradient updates. Normalize your reward to a bounded range such as [0.0, 1.0] or [-1.0, 1.0] before returning it from compute_score.
verl adds a KL penalty between the current actor policy and the reference policy when computing the final reward used for advantage estimation. The per-token KL term is subtracted from (or added to) your scalar reward before PPO update. If your reward function already encodes dense per-step feedback, consider reducing actor_rollout_ref.kl_coeff to avoid double-penalizing the policy for deviation from the reference.

Debugging Reward Functions

The NaiveRewardManager prints decoded responses to the console for the first num_examine batches. This is controlled by:
reward:
  num_workers: 8   # parallel CPU workers for reward computation
You can also test your compute_score function in isolation before plugging it into training:
from my_reward import compute_score

score = compute_score(
    data_source="openai/gsm8k",
    solution_str="Let's think step by step. The answer is 42. #### 42",
    ground_truth="42",
)
print(score)  # 1.0

Build docs developers (and LLMs) love