Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt
Use this file to discover all available pages before exploring further.
Once a LoRe reward basis has been trained on PRISM, it is natural to ask whether the learned directions generalize beyond the training distribution. eval_rb2.py provides this check by evaluating the saved basis vectors on allenai/reward-bench-2, a standard benchmark covering factuality, instruction following, math, safety, focus, and ties. If PRISM accuracy is high but RewardBench 2 accuracy is low, the basis has likely overfit to PRISM’s particular user and dialogue distribution.
eval_rb2.py is only available in the PRISM/ directory. RedditTLDR and
PersonalLLM do not include a RewardBench 2 evaluator.
Prerequisites
Train the PRISM basis and save V checkpoints to disk. run_regularized() in utils.py does this automatically for K≥2:
filename = f"PRISM_V_lore_K_{K}_alpha_{alpha}.pt"
torch.save(V_joint, filename)
Pass the path to one of these .pt files as --rm_head.
Running the evaluation
Basic evaluation with a saved basis
cd LoRe/PRISM
python eval_rb2.py --rm_head "PRISM_V_lore_K_5_alpha_10000.0.pt"
This runs in lmhead mode: the script loads Skywork-Reward-Llama-3.1-8B-v0.2
as a causal LM, extracts the last hidden state, and scores each response
using the provided head matrix.Evaluate multiple basis vectors at once
If your .pt file contains a matrix of shape (H, B) — for example, all
K basis columns stacked — the script evaluates all B heads in a single
pass and reports per-head accuracy:python eval_rb2.py --rm_head "PRISM_V_lore_K_10_alpha_10000.0.pt" \
--top_k 5 --save_csv results_K10.csv
Run the default pre-trained head (no LoRe)
Omit --rm_head to use sequence-classification mode with the unmodified
reward model head. This gives the baseline accuracy before any LoRe
training:python eval_rb2.py --mode seqclf
CLI arguments
python eval_rb2.py [OPTIONS]
--model HF model id or local path (default: Skywork/Skywork-Reward-Llama-3.1-8B-v0.2)
--mode auto | seqclf | lmhead (auto infers from --rm_head presence)
--rm_head Path to saved V weights (.npy or .pt/.pth/.bin); shape (H,), (H,1), or (H,B)
--head_key Key to use if --rm_head is a dict checkpoint
--head_bias Optional bias: scalar or path to .npy/.pt of shape () or (B,)
--split Dataset split to evaluate (default: test)
--subset Evaluate a single subset: Factuality | Precise IF | Math | Safety | Focus | Ties
--limit Limit number of prompts (smoke test)
--batch_size Batch size (default: 8)
--max_length Max token length (default: 2048)
--save_csv Write per-head summary CSV when using multi-head
--top_k Number of top heads to print (default: 10)
--verbose_failures Print failing examples
How scoring works
For each example in RewardBench 2, the script applies the tokenizer’s chat
template to every chosen and rejected response:
def apply_template(tokenizer, prompt: str, response: str) -> str:
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": response},
]
return tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
In lmhead mode, the last non-padding token’s hidden state is projected
through the head matrix V (shape [H, B]):
@torch.inference_mode()
def score_texts_lm_multihead(base_model, tokenizer, texts, device, max_length,
head_matrix, bias=None):
enc = collate(tokenizer, texts, max_length).to(device)
out = base_model(**enc, output_hidden_states=True)
h_last = out.hidden_states[-1] # (N, T, H)
last_idx = enc["attention_mask"].sum(dim=1) - 1
batch = torch.arange(h_last.size(0), device=device)
last_tok = h_last[batch, last_idx] # (N, H)
W = head_matrix.to(device=device, dtype=last_tok.dtype) # (H, B)
scores = last_tok @ W # (N, B)
return scores.detach().float().cpu().numpy()
A prompt is considered correct when every chosen response scores strictly
higher than every rejected response:
min_c = np.min(scores_correct, axis=0) # (B,)
max_i = np.max(scores_incorrect, axis=0) # (B,)
acc_strict = (min_c > max_i).astype(np.float32)
RewardBench 2 subsets
SUBSETS_V2 = ["Factuality", "Precise IF", "Math", "Safety", "Focus", "Ties"]
The Ties subset uses a weighted approximation in addition to strict accuracy:
spread_c = np.max(scores_correct, axis=0) - np.min(scores_correct, axis=0)
ties_bonus = (margin > spread_c).astype(np.float32)
ties_weighted_approx = 0.5 * acc_strict + 0.5 * ties_bonus
Example output
================= RewardBench 2 Results =================
Factuality | mean acc: 72.40% | best head #3 = 74.10% (n=250)
Precise IF | mean acc: 68.20% | best head #1 = 70.50% (n=250)
Math | mean acc: 61.80% | best head #0 = 63.20% (n=250)
Safety | mean acc: 80.10% | best head #2 = 82.30% (n=250)
Focus | mean acc: 75.60% | best head #3 = 77.40% (n=250)
Ties | mean strict: 55.30% | best head #1 = 57.10% | mean weighted: 58.40% (n=100)
---------------------------------------------------------
#01 head 3 | overall(weighted) 72.10% | 5-cat 72.80% | ties strict 56.40% approx 59.20%
If PRISM evaluation shows high accuracy but RewardBench 2 overall is
substantially lower, the learned basis has overfit to PRISM preferences.
Consider increasing alpha (the cosine regularization strength) when
retraining, or reducing the number of training iterations.
Loading the head matrix
eval_rb2.py accepts .npy, .pt, .pth, and .bin files. For a plain
torch.save(V_joint, path) checkpoint:
def load_head_matrix(path: str, key=None) -> np.ndarray:
ext = os.path.splitext(path)[1].lower()
if ext in (".pt", ".bin", ".pth"):
obj = torch.load(path, map_location="cpu")
if isinstance(obj, torch.Tensor):
arr = obj.detach().cpu().numpy()
# ... dict handling omitted for brevity
if arr.ndim not in (1, 2):
raise ValueError(f"Head array must be 1D or 2D, got shape {arr.shape}")
return arr.astype(np.float32, copy=False)
Shapes (H,), (H, 1), and (H, B) are all supported; single-vector files
are automatically broadcast to (H, 1).