One of LoRe’s central design goals is efficient adaptation to users the model has never seen. Once the shared basisDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt
Use this file to discover all available pages before exploring further.
V is trained, a new user only needs to provide a few preference pairs for LoRe to learn their personal mixture weights w — K parameters instead of a full reward model.
Two-phase approach
Learn the shared basis V
Train
V and W jointly on all seen users’ preference data using solve_regularized_simplex. This captures the K most important directions of reward variation across the population. V is saved after training and reused for all future users.Why fixing V works
The key insight is that ifV’s K columns span the meaningful axes of human reward variation, any new user’s preferences can be expressed as a mixture over those axes. The new user does not need to teach the model new reward directions — only their specific combination. This is analogous to how a new user of a recommendation system does not need to create new item categories, only rate a few items to reveal their taste profile.
PersonalizeBatch: fixing V and optimizing w
PersonalizeBatch holds a ParameterList of per-user weight vectors and uses a single shared Adam optimizer:
Forward pass
V is passed in as a fixed tensor argument, not as a module parameter, so it is never differentiated:
Training loop
train returns a list of simplex-constrained weight vectors — one per user — ready for evaluation with eval_multiple.
learn_multiple_few_shot: the entry point
learn_multiple_few_shot wraps PersonalizeBatch and handles all users simultaneously:
train_features is a list of tensors, one per user, where each tensor contains that user’s few-shot preference differences (chosen − rejected embeddings). V is the frozen basis from phase 1.
sample_shots: controlling how many examples each user provides
Before calling learn_multiple_few_shot, the sample_shots utility randomly subsamples each user’s training data to a fixed number of preference pairs:
sample_shots produces a different random subsample, which is why the vary_fewshot.py scripts run multiple independent trials and report mean and standard deviation of accuracy.
Evaluating accuracy vs. number of shots
Therun_few_shot_vary_shots function sweeps over both rank K and shot count, running multiple trials at each combination:
K <= 1, adaptation is skipped entirely — all users are assigned a constant all-ones weight, which corresponds to the shared single-reward baseline. This is the natural lower bound for few-shot performance.
K=0 and K=1 baselines
| Baseline | Behavior |
|---|---|
K=0 | V is fixed to the pretrained V_sft; w = [1.0] for every user. The model is the reference reward model with no personalization. |
K=1 | A single shared reward direction is learned; w = [1.0] for every user. Equivalent to Bradley-Terry with no user variation. |
K≥2 | Full LoRe: V has K columns; w_i is adapted per user via PersonalizeBatch. |
Both
K=0 and K=1 use the all-ones weight fallback in the run and run_regularized pipelines, making them directly comparable against LoRe variants without any code changes to the evaluation loop.Expected behavior
Accuracy on unseen users improves along two axes:- More shots: more preference examples give a better estimate of
w_i, reducing adaptation error. - Higher K: a richer basis
Vcan represent more diverse users accurately, so a well-adaptedw_iachieves higher test accuracy.
vary_fewshot.py script in RedditTLDR/ produces the shot-vs-accuracy curves used in the paper’s figures. PRISM evaluates few-shot accuracy as part of train_basis.py rather than a separate script.