Reward functions are the bridge between model outputs and the RL training signal. verl supports two reward approaches — rule-based (verifiable) and model-based — and provides a cleanDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt
Use this file to discover all available pages before exploring further.
RewardManager abstraction that routes each sample to the right scoring function based on its data_source field. You can use the pre-built reward functions for common math benchmarks, or implement your own scoring logic in a standalone Python file and register it through config.
Two Reward Approaches
- Rule-Based (Verifiable)
- Model-Based
The response is scored programmatically against a ground-truth string extracted from the dataset. This approach is fast (no GPU required), deterministic, and works well for tasks with objectively correct answers such as math, code execution, and factual question answering.Examples: exact-match on a numeric answer, regex extraction followed by string comparison, unit test execution for code.
RewardManager sums or blends scores from multiple sources before returning the final per-token reward tensor used in policy optimization.
RewardManager
TheRewardManager is the component that sits between the trainer and the scoring logic. It is instantiated in verl/trainer/main_ppo.py and called at each RL step to score a batch of generated responses.
When called, RewardManager.__call__(data: DataProto) performs the following:
- If the
DataProtoalready contains reward model scores from aRewardWorkerGroup, those are returned directly. - Otherwise, for each sample in the batch, it:
- Extracts the
data_sourceandground_truthfromdata.non_tensor_batch. - Decodes the response token IDs to a string using the tokenizer.
- Calls
compute_score_fn(data_source, solution_str, ground_truth, extra_info). - Writes the scalar score into the last valid token position of the reward tensor.
- Extracts the
- Returns a token-level reward tensor of shape
(batch_size, response_length).
Inputs Available to the Reward Function
TheDataProto passed through the reward pipeline contains the following fields:
| Field | Location | Description |
|---|---|---|
input_ids | batch | Tokenized prompt + response (after applying chat template) |
attention_mask | batch | Attention mask for input_ids |
responses | batch | Response token IDs only |
prompts | batch | Prompt token IDs only |
ground_truth | non_tensor_batch | Answer string from reward_model.ground_truth in the Parquet file |
data_source | non_tensor_batch | Dataset identifier string from the Parquet field |
Custom Reward Function Signature
A custom reward function is a plain Python function with the following signature:float. There is no requirement to handle batching — the RewardManager loops over the batch.
Registering via Config
Point verl at your reward function file usingcustom_reward_function in the trainer config:
compute_score and only run a single experiment, you can leave name unset — it defaults to compute_score. To run multiple experiments with different functions defined in the same file, set name explicitly for each run:
Pre-Built Reward Functions
verl ships ready-to-use reward functions for the most common math reasoning benchmarks inverl/utils/reward_score/.
GSM8K
The GSM8K reward function (verl/utils/reward_score/gsm8k.py) enforces a strict output format — the model must emit its final answer after #### — and compares it against the extracted ground truth:
| Condition | Score |
|---|---|
Answer extracted and matches ground_truth exactly | 1.0 |
Answer in correct #### <number> format but wrong value | format_score (default 0.0) |
No #### <number> pattern found | 0 |
method="strict" mode tests both format and correctness. The method="flexible" mode accepts any trailing number as the answer, which is more lenient about formatting.
MATH
The MATH reward function (verl/utils/reward_score/math_reward.py) follows the scoring implementation from the lm-evaluation-harness repository, which handles LaTeX normalization and symbolic equivalence checking for competition mathematics.
Writing a Custom Reward Function
Example: Binary Exact Match
Example: Length-Normalized Reward
Example: Multi-Dataset Dispatch
If your training set mixes multiple datasets, a singlecompute_score function can dispatch to different scoring logic based on data_source:
Model-Based Rewards
To use a trained reward model instead of (or in addition to) a rule-based function, configure thereward.reward_model block:
enable: true, verl spins up a RewardWorkerGroup that loads the reward model and runs inference on each response batch. The scalar scores are written into the DataProto and the RewardManager returns them directly without calling a compute_score function.
Reward Scaling and Stability
The reward signal is used directly in the PPO advantage computation. Raw scores that are very large (e.g., token counts in the hundreds) will produce large advantages and destabilize policy gradient updates. Normalize your reward to a bounded range such as
[0.0, 1.0] or [-1.0, 1.0] before returning it from compute_score.Debugging Reward Functions
TheNaiveRewardManager prints decoded responses to the console for the first num_examine batches. This is controlled by:
compute_score function in isolation before plugging it into training: