Traditional reward models produce a single scalar score shared across all users, which loses the diversity of human preferences. LoRe replaces that scalar with a low-rank factorization — a matrixDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt
Use this file to discover all available pages before exploring further.
V of shared reward directions and a per-user weight vector w_i — enabling personalized reward signals while keeping the parameter count small.
The core factorization
A conventional Bradley-Terry reward model assigns a scalar reward to a response embeddingx:
v is a single direction in feature space, learned from aggregated preference data and shared by everyone. Any user-level variation is discarded.
LoRe generalizes this to a rank-K factorization:
xis the response feature vector (shape[features])Vis the shared basis matrix (shape[features × K])w_iis useri’s mixture weight vector (shape[K])Kis the rank — the number of independent reward directions
V captures the K most important axes of reward variation across the user population. Each user’s weight vector w_i then selects a mixture over those axes, personalizing their reward without training a separate model.
Rank as a spectrum of expressiveness
| K value | Interpretation |
|---|---|
K=0 | Reference model: V is frozen to the pretrained V_sft; all weights are 1. No personalization. |
K=1 | Equivalent to a standard Bradley-Terry single reward: one direction, all users share it. |
K≥2 | Full LoRe: multiple diverse directions; each user has a distinct mixture. |
K=0 and K=1 serve as baselines in experiments. When K <= 1, the run and run_regularized functions assign all users the same all-ones weight vector rather than optimizing anything.Softmax constraint on W
User weights are stored as raw logits and passed through a softmax before use:w_i on the probability simplex: weights are non-negative and sum to 1. The simplex constraint makes w_i interpretable as a mixture distribution over reward basis directions.
Loss function
LoRe uses the Bradley-Terry pairwise loss: given a preference pair wherex is the feature difference between the chosen and rejected response, the model should predict a positive score. The loss is the negative log-sigmoid (negative log-likelihood):
Regularization toward the pretrained model
A key risk in learningV from scratch is that it drifts far from the pretrained supervised fine-tuning (SFT) reward model’s final layer V_sft. LoRe adds a cosine similarity regularization to keep learned basis directions aligned:
1 - cos_sim is 0 when a learned column of V is perfectly aligned with the corresponding pretrained column, and grows as the direction rotates away. The scalar alpha controls the regularization strength.
The LoRe_regularized forward pass
The full forward computation inside LoRe_regularized._forward_from_packed:
X_cat is a concatenation of all users’ preference differences; y carries the user index for each row so the correct column of Vw is selected via gather.
Alternating minimization training
LoRe trainsW and V with two separate Adam optimizers, updating them in alternating steps within each iteration:
W, the regularization term is set to zero (alpha_curr=0.0) because there is no reason to penalize W for V’s alignment with V_sft. When updating V, the full regularized loss is used.
The alpha warmup schedule
Regularization is not applied immediately._alpha_at_step implements a linear warmup between 20% and 80% of total iterations:
V move freely early in training (exploring the loss landscape) before the regularization gradually pulls it toward V_sft.
The solve_regularized_simplex entry point
The top-level function that instantiates LoRe_regularized and returns the trained weights:
LoRe_regularized.train also prunes basis directions whose maximum softmax weight across all users falls below 1e-2, keeping only directions that at least one user meaningfully uses.