Evaluate learned reward basis on RewardBench 2

Once a LoRe reward basis has been trained on PRISM, it is natural to ask whether the learned directions generalize beyond the training distribution. eval_rb2.py provides this check by evaluating the saved basis vectors on allenai/reward-bench-2, a standard benchmark covering factuality, instruction following, math, safety, focus, and ties. If PRISM accuracy is high but RewardBench 2 accuracy is low, the basis has likely overfit to PRISM’s particular user and dialogue distribution.

eval_rb2.py is only available in the PRISM/ directory. RedditTLDR and PersonalLLM do not include a RewardBench 2 evaluator.

Prerequisites

Train the PRISM basis and save V checkpoints to disk. run_regularized() in utils.py does this automatically for K≥2:

filename = f"PRISM_V_lore_K_{K}_alpha_{alpha}.pt"
torch.save(V_joint, filename)

Pass the path to one of these .pt files as --rm_head.

Running the evaluation

Basic evaluation with a saved basis

cd LoRe/PRISM
python eval_rb2.py --rm_head "PRISM_V_lore_K_5_alpha_10000.0.pt"

This runs in lmhead mode: the script loads Skywork-Reward-Llama-3.1-8B-v0.2 as a causal LM, extracts the last hidden state, and scores each response using the provided head matrix.

Evaluate multiple basis vectors at once

If your .pt file contains a matrix of shape (H, B) — for example, all K basis columns stacked — the script evaluates all B heads in a single pass and reports per-head accuracy:

python eval_rb2.py --rm_head "PRISM_V_lore_K_10_alpha_10000.0.pt" \
    --top_k 5 --save_csv results_K10.csv

Run the default pre-trained head (no LoRe)

Omit --rm_head to use sequence-classification mode with the unmodified reward model head. This gives the baseline accuracy before any LoRe training:

python eval_rb2.py --mode seqclf

CLI arguments

python eval_rb2.py [OPTIONS]

--model          HF model id or local path (default: Skywork/Skywork-Reward-Llama-3.1-8B-v0.2)
--mode           auto | seqclf | lmhead  (auto infers from --rm_head presence)
--rm_head        Path to saved V weights (.npy or .pt/.pth/.bin); shape (H,), (H,1), or (H,B)
--head_key       Key to use if --rm_head is a dict checkpoint
--head_bias      Optional bias: scalar or path to .npy/.pt of shape () or (B,)
--split          Dataset split to evaluate (default: test)
--subset         Evaluate a single subset: Factuality | Precise IF | Math | Safety | Focus | Ties
--limit          Limit number of prompts (smoke test)
--batch_size     Batch size (default: 8)
--max_length     Max token length (default: 2048)
--save_csv       Write per-head summary CSV when using multi-head
--top_k          Number of top heads to print (default: 10)
--verbose_failures  Print failing examples

How scoring works

For each example in RewardBench 2, the script applies the tokenizer’s chat template to every chosen and rejected response:

def apply_template(tokenizer, prompt: str, response: str) -> str:
    messages = [
        {"role": "user",      "content": prompt},
        {"role": "assistant", "content": response},
    ]
    return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )

In lmhead mode, the last non-padding token’s hidden state is projected through the head matrix V (shape [H, B]):

@torch.inference_mode()
def score_texts_lm_multihead(base_model, tokenizer, texts, device, max_length,
                              head_matrix, bias=None):
    enc    = collate(tokenizer, texts, max_length).to(device)
    out    = base_model(**enc, output_hidden_states=True)
    h_last = out.hidden_states[-1]            # (N, T, H)
    last_idx = enc["attention_mask"].sum(dim=1) - 1
    batch    = torch.arange(h_last.size(0), device=device)
    last_tok = h_last[batch, last_idx]        # (N, H)

    W      = head_matrix.to(device=device, dtype=last_tok.dtype)   # (H, B)
    scores = last_tok @ W                                           # (N, B)
    return scores.detach().float().cpu().numpy()

A prompt is considered correct when every chosen response scores strictly higher than every rejected response:

min_c    = np.min(scores_correct, axis=0)   # (B,)
max_i    = np.max(scores_incorrect, axis=0) # (B,)
acc_strict = (min_c > max_i).astype(np.float32)

RewardBench 2 subsets

SUBSETS_V2 = ["Factuality", "Precise IF", "Math", "Safety", "Focus", "Ties"]

The Ties subset uses a weighted approximation in addition to strict accuracy:

spread_c           = np.max(scores_correct, axis=0) - np.min(scores_correct, axis=0)
ties_bonus         = (margin > spread_c).astype(np.float32)
ties_weighted_approx = 0.5 * acc_strict + 0.5 * ties_bonus

Example output

================= RewardBench 2 Results =================
Factuality   | mean acc:   72.40% | best head #3 = 74.10%  (n=250)
Precise IF   | mean acc:   68.20% | best head #1 = 70.50%  (n=250)
Math         | mean acc:   61.80% | best head #0 = 63.20%  (n=250)
Safety       | mean acc:   80.10% | best head #2 = 82.30%  (n=250)
Focus        | mean acc:   75.60% | best head #3 = 77.40%  (n=250)
Ties         | mean strict: 55.30% | best head #1 = 57.10% | mean weighted: 58.40%  (n=100)
---------------------------------------------------------
#01 head    3 | overall(weighted) 72.10% | 5-cat 72.80% | ties strict 56.40% approx 59.20%

If PRISM evaluation shows high accuracy but RewardBench 2 overall is substantially lower, the learned basis has overfit to PRISM preferences. Consider increasing alpha (the cosine regularization strength) when retraining, or reducing the number of training iterations.

Loading the head matrix

eval_rb2.py accepts .npy, .pt, .pth, and .bin files. For a plain torch.save(V_joint, path) checkpoint:

def load_head_matrix(path: str, key=None) -> np.ndarray:
    ext = os.path.splitext(path)[1].lower()
    if ext in (".pt", ".bin", ".pth"):
        obj = torch.load(path, map_location="cpu")
        if isinstance(obj, torch.Tensor):
            arr = obj.detach().cpu().numpy()
        # ... dict handling omitted for brevity
    if arr.ndim not in (1, 2):
        raise ValueError(f"Head array must be 1D or 2D, got shape {arr.shape}")
    return arr.astype(np.float32, copy=False)

Shapes (H,), (H, 1), and (H, B) are all supported; single-vector files are automatically broadcast to (H, 1).

Get Started

Concepts

Datasets

Training & Evaluation

Evaluate learned reward basis on RewardBench 2

Prerequisites

Running the evaluation

CLI arguments

How scoring works

RewardBench 2 subsets

Example output

Loading the head matrix

Build docs developers (and LLMs) love

Get Started

Concepts

Datasets

Training & Evaluation

Documentation Index

​Prerequisites

​Running the evaluation

​CLI arguments

​How scoring works

​RewardBench 2 subsets

​Example output

​Loading the head matrix

Build docs developers (and LLMs) love

Prerequisites

Running the evaluation

CLI arguments

How scoring works

RewardBench 2 subsets

Example output

Loading the head matrix