Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt

Use this file to discover all available pages before exploring further.

PersonalLLM is a HuggingFace dataset that provides 8 LLM-generated responses per prompt, each pre-scored by 10 distinct open-source reward models. LoRe uses it as a fully synthetic benchmark: rather than relying on collected human annotations, it samples a population of virtual users from a Dirichlet distribution over the 10 reward model weights, then simulates each user’s pairwise preferences from those scores. This makes PersonalLLM ideal for controlled experiments on personalized reward learning at scale.

Dataset source

The dataset is loaded directly from HuggingFace:
from datasets import load_dataset

dataset = load_dataset("namkoong-lab/PersonalLLM")
data      = pd.DataFrame(dataset["train"])
data_test = pd.DataFrame(dataset["test"])
Each row contains a prompt field and eight response columns (response_1response_8). Every response is pre-scored by the 10 reward models listed below.

The 10 reward models

model_names = [
    "gemma_2b",
    "gemma_7b",
    "mistral_raft",
    "llama3_sfairx",
    "oasst_deberta_v3",
    "beaver_7b",
    "oasst_pythia_7b",
    "oasst_pythia_1b",
    "mistral_ray",
    "mistral_weqweasdas",
]
Score columns follow the pattern response_{i}_{model_name}, for example response_3_gemma_7b.

Reward tensor layout

All scores are assembled into a three-dimensional NumPy array:
# shape: (num_prompts, 8, 10)
reward_tensor = np.empty((num_prompts, 8, len(model_names)), dtype=object)

for index, row in data.iterrows():
    prompt_array = np.empty((8, len(model_names)), dtype=object)
    for i in range(1, 9):
        for j, model_name in enumerate(model_names):
            column_name = f"response_{i}_{model_name}"
            prompt_array[i - 1, j] = row[column_name] if column_name in row else None
    reward_tensor[index] = prompt_array
The axes are [prompt_index, response_index, reward_model_index].

Embeddings: train.safetensors and test.safetensors

prepare.py runs each prompt–response pair through Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 and extracts the last hidden-state vector of the final token:
with torch.no_grad():
    output = rm(conv_tokenized)
    embeddings.append(output.last_hidden_state[0][-1].cpu())
The stacked embeddings are saved with the safetensors library:
from safetensors.torch import save_file

save_file({"embeddings": embeddings}, "./train.safetensors")
train_basis.py reloads them at training time:
from safetensors.torch import load_file

embeddings = load_file("train.safetensors")["embeddings"]
prepare.py must be run once for the training split and once for the test split before train_basis.py can execute. The resulting .safetensors files are cached on disk and do not need to be regenerated.

Synthetic user population

generate_popupulation

Users are Dirichlet samples over the 10 reward model axes. The concentration parameter alpha controls how peaked or diffuse preferences are:
def generate_popupulation(alpha, N):
    return np.random.dirichlet(alpha, N)

alpha_val = 0.1
alpha = alpha_val * np.ones(len(model_names))  # 0.1 * ones(10)
W = generate_popupulation(alpha, N)            # shape: (N, 10)
A small alpha (0.1) produces sparse, polarized users — each user effectively cares about only one or two reward models.

simulate_population

Given a user weight vector w, the preferred response for each prompt is the one with the highest dot-product score, and the rejected response is the one with the lowest:
def simulate_user(reward_tensor, features, w):
    num_prompts = len(reward_tensor)
    feature_diff = []
    for i in range(num_prompts):
        scores = np.dot(reward_tensor[i], w)
        largest_score_index  = np.argmax(scores)
        smallest_score_index = np.argmin(scores)
        feature_diff.append(
            features[i][largest_score_index] - features[i][smallest_score_index]
        )
    return torch.stack(feature_diff, dim=0)

def simulate_population(reward_tensor, features, W):
    all_feature_diff = [simulate_user(reward_tensor, features, w) for w in W]
    return torch.stack(all_feature_diff, dim=0)
The output is a difference-of-embedding vector for each (user, prompt) pair that represents the preference signal used during training.

create_sparse_tensor

Real users only see a fraction of all prompts. This function randomly samples that fraction:
def create_sparse_tensor(dense_tensor, sample_percentage):
    N, M, d = dense_tensor.shape
    num_samples_per_row = int(sample_percentage * M)
    sparse_rows = []
    for i in range(N):
        indices = np.random.choice(M, num_samples_per_row, replace=False)
        values  = dense_tensor[i, indices]
        sparse_rows.append(torch.tensor(values, dtype=torch.float32).to(device))
    return sparse_rows

User split configuration

# Seen training users  (0.5% of prompts observed per user)
N = 1000
all_feature_diff = simulate_population(reward_tensor, features, W)
train_features   = create_sparse_tensor(all_feature_diff, 0.005)

# Seen users evaluated on unseen prompts (100% of test prompts)
all_feature_diff_test   = simulate_population(reward_tensor_test, test_features, W)
test_features_sparse    = create_sparse_tensor(all_feature_diff_test, 1.0)

# Unseen users — few-shot calibration
N_unseen = 500
W_unseen = generate_popupulation(alpha, N_unseen)
all_feature_diff_unseen = simulate_population(reward_tensor, features, W_unseen)
train_features_unseen   = create_sparse_tensor(all_feature_diff_unseen, 0.001)

# Unseen users evaluated on unseen prompts
all_feature_diff_test_unseen   = simulate_population(reward_tensor_test, test_features, W_unseen)
test_features_sparse_unseen    = create_sparse_tensor(all_feature_diff_test_unseen, 1.0)
The 0.5% training density (0.005) and 0.1% few-shot density (0.001) deliberately simulate sparse real-world feedback — most users rate only a tiny fraction of prompts.

Evaluation grid

K_list     = [0, 1, 2, 3, 4, 5]
alpha_list = [0]
K=0 uses the pre-trained reward head as a fixed reference. K=1 is the single-vector Bradley-Terry baseline. K≥2 are full LoRe runs with increasing basis rank.

Running the full pipeline

1

Install dependencies

pip install -r requirements.txt
2

Compute and cache embeddings

cd LoRe/PersonalLLM
python prepare.py
This generates train.safetensors and (when run on the test split) test.safetensors in the current directory. Requires a GPU with ~16 GB VRAM for Skywork-Reward-Llama-3.1-8B-v0.2 with flash_attention_2.
3

Train the reward basis

python train_basis.py
Trains V and W jointly for each K in K_list and prints per-rank accuracy across all four evaluation settings.

Build docs developers (and LLMs) love