Evaluation and accuracy functions

LoRe measures reward model quality by checking how often a personalized reward correctly ranks the chosen response above the rejected one. evaluate_model() does this for a single user; eval_multiple() aggregates across a user population; sample_shots() creates sub-sampled few-shot datasets for controlled evaluation of sample-efficiency.

evaluate_model()

Computes the fraction of preference pairs for which a user’s personalized reward model correctly ranks chosen over rejected.

from utils import evaluate_model

acc = evaluate_model(X, V, w)
# acc: float in [0, 1]

The accuracy is defined as:

(X @ V @ w > 0).mean()

where X contains feature differences (chosen embedding minus rejected embedding), so the reward correctly ranks a pair when the dot product is positive.

Parameters

Tensor or array-like

required

Feature difference matrix for one user, shape [M, F]. Each row is embedding(chosen) - embedding(rejected) for a single preference pair. Internally converted to torch.float32 if not already a tensor.

torch.Tensor

required

Reward basis matrix, shape [F, K]. Shared across users after basis training.

torch.Tensor

required

Per-user weight vector over basis directions, shape [K]. Should be softmax-normalized (output of PersonalizeBatch.train() or LoRe.train()).

Return value

fraction_positive

float

Fraction of preference pairs ranked correctly, in [0, 1]. Random chance yields 0.5; a perfect model yields 1.0.

Example

# After run() or solve_regularized():
for i in range(N):
    acc = evaluate_model(
        X=test_features_sparse[i],
        V=V_joint.detach(),
        w=W_joint[i],
    )
    print(f"User {i}: {acc:.4f}")

evaluate_model() accepts X as either a NumPy array or a PyTorch tensor. It wraps the input in torch.tensor(..., dtype=torch.float32) unconditionally, which creates a copy even if X is already a float32 tensor. For performance in tight evaluation loops, pass tensors directly only if you accept the copy overhead.

eval_multiple()

Evaluates accuracy for a list of users, each potentially with a different V and w. Prints population mean and standard deviation, then returns all per-user accuracies.

from utils import eval_multiple

# Seen users, shared V
accuracies = eval_multiple(
    W_list=W_joint,                                     # list of N tensors [K]
    V_list=[V_joint.detach() for _ in range(N)],        # list of N tensors [F, K]
    test_features=test_features_sparse,                 # list of N tensors [M_i, F]
)
# Prints:
#   Average accuracy: 0.7234
#   Standard deviation of accuracy: 0.0841

Parameters

W_list

list[Tensor]

required

Per-user weight vectors, one tensor per user, each shape [K]. Must be the same length as test_features.

V_list

list[Tensor]

required

Reward basis matrices, one per user, each shape [F, K]. In most cases all entries are the same shared V_joint; the list form supports heterogeneous bases if needed.

test_features

list[Tensor]

required

Per-user feature difference tensors, shape [M_i, F] per user. Must have the same length as W_list and V_list.

Return value

accuracies

list[float]

Per-user accuracy values, same length as test_features. Each value is evaluate_model(test_features[i], V_list[i], W_list[i]).

Example

# Unseen users after few-shot adaptation
W_few_shot = learn_multiple_few_shot(
    train_features_unseen, V_joint.detach(),
    num_iterations=500, learning_rate=0.1,
)
accuracies = eval_multiple(
    W_list=W_few_shot,
    V_list=[V_joint.detach() for _ in range(N_unseen)],
    test_features=test_features_sparse_unseen,
)
mean_acc = np.mean(accuracies)

sample_shots()

Randomly sub-samples a fixed number of preference pairs from each user’s tensor. Used to construct few-shot datasets for controlled evaluation of sample-efficiency experiments.

from utils import sample_shots

few_shot_features = sample_shots(
    train_features_unseen=train_features_unseen,
    shots=10,
)
# few_shot_features: list of N_unseen tensors, each shape [10, F]

Parameters

train_features_unseen

list[Tensor]

required

Full per-user preference tensors. List of N_unseen tensors, each shape [M_i, F]. Each user must have at least shots rows; no bounds checking is performed.

shots

int

required

Number of preference pairs to sample per user. Sampling is without replacement via a random permutation of row indices.

Return value

sampled_features

list[Tensor]

List of N_unseen tensors, each shape [shots, F], containing randomly selected rows from the corresponding input tensor.

Example

# Vary shots in an evaluation loop
for shots in [1, 5, 10, 20, 50]:
    few_shot_data = sample_shots(train_features_unseen, shots)
    W = learn_multiple_few_shot(
        few_shot_data, V_joint.detach(),
        num_iterations=500, learning_rate=0.1,
    )
    accs = eval_multiple(
        W, [V_joint.detach()] * N_unseen, test_features_sparse_unseen
    )
    print(f"shots={shots}: {np.mean(accs):.4f} ± {np.std(accs):.4f}")

If shots exceeds M_i for any user, torch.randperm(tensor.size(0))[:shots] will silently return fewer rows than requested rather than raising an error. Validate your data lengths before calling this function in a loop.

Core API

Evaluation and accuracy functions

evaluate_model()

Parameters

Return value

Example

eval_multiple()

Parameters

Return value

Example

sample_shots()

Parameters

Return value

Example

Build docs developers (and LLMs) love

Core API

Documentation Index

​evaluate_model()

​Parameters

​Return value

​Example

​eval_multiple()

​Parameters

​Return value

​Example

​sample_shots()

​Parameters

​Return value

​Example

Build docs developers (and LLMs) love

evaluate_model()

Parameters

Return value

Example

eval_multiple()

Parameters

Return value

Example

sample_shots()

Parameters

Return value

Example