Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt

Use this file to discover all available pages before exploring further.

One of LoRe’s central design goals is efficient adaptation to users the model has never seen. Once the shared basis V is trained, a new user only needs to provide a few preference pairs for LoRe to learn their personal mixture weights wK parameters instead of a full reward model.

Two-phase approach

1

Learn the shared basis V

Train V and W jointly on all seen users’ preference data using solve_regularized_simplex. This captures the K most important directions of reward variation across the population. V is saved after training and reused for all future users.
2

Adapt new users with few-shot examples

Fix V completely. For each new user, collect a small set of preference pairs (the “shots”), then optimize only the K-dimensional weight vector w using PersonalizeBatch. Because V is fixed, this optimization has K free parameters per user and converges quickly.

Why fixing V works

The key insight is that if V’s K columns span the meaningful axes of human reward variation, any new user’s preferences can be expressed as a mixture over those axes. The new user does not need to teach the model new reward directions — only their specific combination. This is analogous to how a new user of a recommendation system does not need to create new item categories, only rate a few items to reveal their taste profile.
There is a direct tradeoff between K (rank) and adaptation speed. Higher K gives V more expressive capacity to represent diverse users, but each new user needs more preference examples to reliably estimate their K-dimensional w. In practice, K should be set to match the diversity of your user population, not maximized blindly.

PersonalizeBatch: fixing V and optimizing w

PersonalizeBatch holds a ParameterList of per-user weight vectors and uses a single shared Adam optimizer:
class PersonalizeBatch(nn.Module):
    def __init__(self, num_classes, num_features, num_basis_vectors,
                 num_iterations, learning_rate):
        super(PersonalizeBatch, self).__init__()
        self.num_classes = num_classes
        self.num_basis_vectors = num_basis_vectors
        self.num_iterations = num_iterations
        self.learning_rate = learning_rate

        # One weight vector per user; V is NOT a parameter here
        self.w = nn.ParameterList([
            nn.Parameter(torch.randn(num_basis_vectors))
            for _ in range(num_classes)
        ])
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

Forward pass

V is passed in as a fixed tensor argument, not as a module parameter, so it is never differentiated:
def forward(self, X, V):
    nll = 0
    for i, x in enumerate(X):
        V_w = V @ F.softmax(self.w[i])      # [features] — user-specific direction
        logits = x @ V_w / 100.0
        log_likelihood = torch.log(torch.sigmoid(logits))
        nll += (-log_likelihood.sum()) / len(x)
    return nll

Training loop

def train(self, X, V):
    for j in range(self.num_iterations):
        self.optimizer.zero_grad()
        loss = self.forward(X, V)
        loss.backward()
        self.optimizer.step()
    return [F.softmax(self.w[i]).detach() for i in range(len(X))]
train returns a list of simplex-constrained weight vectors — one per user — ready for evaluation with eval_multiple.

learn_multiple_few_shot: the entry point

learn_multiple_few_shot wraps PersonalizeBatch and handles all users simultaneously:
def learn_multiple_few_shot(train_features, V, num_iterations=1000, learning_rate=0.01):
    N = len(train_features)
    num_features = train_features[0][0].shape[0]
    fitw = PersonalizeBatch(
        N, num_features, V.shape[1], num_iterations, learning_rate
    ).to(device)
    W = fitw.train(train_features, V)
    return W
train_features is a list of tensors, one per user, where each tensor contains that user’s few-shot preference differences (chosen − rejected embeddings). V is the frozen basis from phase 1.

sample_shots: controlling how many examples each user provides

Before calling learn_multiple_few_shot, the sample_shots utility randomly subsamples each user’s training data to a fixed number of preference pairs:
def sample_shots(train_features_unseen, shots):
    """
    Sample 'shots' number of tensors from each tensor in train_features_unseen.
    Args:
        train_features_unseen (list): A list of tensors.
        shots (int): The number of samples to take from each tensor.
    Returns:
        list: A list of sampled tensors.
    """
    sampled_features = [
        tensor[torch.randperm(tensor.size(0))[:shots]]
        for tensor in train_features_unseen
    ]
    return sampled_features
Each call to sample_shots produces a different random subsample, which is why the vary_fewshot.py scripts run multiple independent trials and report mean and standard deviation of accuracy.

Evaluating accuracy vs. number of shots

The run_few_shot_vary_shots function sweeps over both rank K and shot count, running multiple trials at each combination:
for shots in num_shots:
    for _ in range(trials):
        train_features_unseen_shots = sample_shots(train_features_unseen, shots)

        if K <= 1:
            W_few_shot = [torch.tensor([1.0]).to(device) for _ in range(N_unseen)]
        else:
            W_few_shot = learn_multiple_few_shot(
                train_features_unseen_shots, V_joint.detach(),
                num_iterations=500, learning_rate=0.1
            )
When K <= 1, adaptation is skipped entirely — all users are assigned a constant all-ones weight, which corresponds to the shared single-reward baseline. This is the natural lower bound for few-shot performance.

K=0 and K=1 baselines

BaselineBehavior
K=0V is fixed to the pretrained V_sft; w = [1.0] for every user. The model is the reference reward model with no personalization.
K=1A single shared reward direction is learned; w = [1.0] for every user. Equivalent to Bradley-Terry with no user variation.
K≥2Full LoRe: V has K columns; w_i is adapted per user via PersonalizeBatch.
Both K=0 and K=1 use the all-ones weight fallback in the run and run_regularized pipelines, making them directly comparable against LoRe variants without any code changes to the evaluation loop.

Expected behavior

Accuracy on unseen users improves along two axes:
  • More shots: more preference examples give a better estimate of w_i, reducing adaptation error.
  • Higher K: a richer basis V can represent more diverse users accurately, so a well-adapted w_i achieves higher test accuracy.
The vary_fewshot.py script in RedditTLDR/ produces the shot-vs-accuracy curves used in the paper’s figures. PRISM evaluates few-shot accuracy as part of train_basis.py rather than a separate script.

Build docs developers (and LLMs) love