Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt
Use this file to discover all available pages before exploring further.
PersonalLLM is a HuggingFace dataset that provides 8 LLM-generated responses per prompt, each pre-scored by 10 distinct open-source reward models. LoRe uses it as a fully synthetic benchmark: rather than relying on collected human annotations, it samples a population of virtual users from a Dirichlet distribution over the 10 reward model weights, then simulates each user’s pairwise preferences from those scores. This makes PersonalLLM ideal for controlled experiments on personalized reward learning at scale.
Dataset source
The dataset is loaded directly from HuggingFace:
from datasets import load_dataset
dataset = load_dataset("namkoong-lab/PersonalLLM")
data = pd.DataFrame(dataset["train"])
data_test = pd.DataFrame(dataset["test"])
Each row contains a prompt field and eight response columns (response_1 … response_8). Every response is pre-scored by the 10 reward models listed below.
The 10 reward models
model_names = [
"gemma_2b",
"gemma_7b",
"mistral_raft",
"llama3_sfairx",
"oasst_deberta_v3",
"beaver_7b",
"oasst_pythia_7b",
"oasst_pythia_1b",
"mistral_ray",
"mistral_weqweasdas",
]
Score columns follow the pattern response_{i}_{model_name}, for example response_3_gemma_7b.
Reward tensor layout
All scores are assembled into a three-dimensional NumPy array:
# shape: (num_prompts, 8, 10)
reward_tensor = np.empty((num_prompts, 8, len(model_names)), dtype=object)
for index, row in data.iterrows():
prompt_array = np.empty((8, len(model_names)), dtype=object)
for i in range(1, 9):
for j, model_name in enumerate(model_names):
column_name = f"response_{i}_{model_name}"
prompt_array[i - 1, j] = row[column_name] if column_name in row else None
reward_tensor[index] = prompt_array
The axes are [prompt_index, response_index, reward_model_index].
Embeddings: train.safetensors and test.safetensors
prepare.py runs each prompt–response pair through Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 and extracts the last hidden-state vector of the final token:
with torch.no_grad():
output = rm(conv_tokenized)
embeddings.append(output.last_hidden_state[0][-1].cpu())
The stacked embeddings are saved with the safetensors library:
from safetensors.torch import save_file
save_file({"embeddings": embeddings}, "./train.safetensors")
train_basis.py reloads them at training time:
from safetensors.torch import load_file
embeddings = load_file("train.safetensors")["embeddings"]
prepare.py must be run once for the training split and once for the test
split before train_basis.py can execute. The resulting .safetensors files
are cached on disk and do not need to be regenerated.
Synthetic user population
Users are Dirichlet samples over the 10 reward model axes. The concentration
parameter alpha controls how peaked or diffuse preferences are:
def generate_popupulation(alpha, N):
return np.random.dirichlet(alpha, N)
alpha_val = 0.1
alpha = alpha_val * np.ones(len(model_names)) # 0.1 * ones(10)
W = generate_popupulation(alpha, N) # shape: (N, 10)
A small alpha (0.1) produces sparse, polarized users — each user effectively
cares about only one or two reward models.
simulate_population
Given a user weight vector w, the preferred response for each prompt is the
one with the highest dot-product score, and the rejected response is the one
with the lowest:
def simulate_user(reward_tensor, features, w):
num_prompts = len(reward_tensor)
feature_diff = []
for i in range(num_prompts):
scores = np.dot(reward_tensor[i], w)
largest_score_index = np.argmax(scores)
smallest_score_index = np.argmin(scores)
feature_diff.append(
features[i][largest_score_index] - features[i][smallest_score_index]
)
return torch.stack(feature_diff, dim=0)
def simulate_population(reward_tensor, features, W):
all_feature_diff = [simulate_user(reward_tensor, features, w) for w in W]
return torch.stack(all_feature_diff, dim=0)
The output is a difference-of-embedding vector for each (user, prompt) pair
that represents the preference signal used during training.
create_sparse_tensor
Real users only see a fraction of all prompts. This function randomly samples
that fraction:
def create_sparse_tensor(dense_tensor, sample_percentage):
N, M, d = dense_tensor.shape
num_samples_per_row = int(sample_percentage * M)
sparse_rows = []
for i in range(N):
indices = np.random.choice(M, num_samples_per_row, replace=False)
values = dense_tensor[i, indices]
sparse_rows.append(torch.tensor(values, dtype=torch.float32).to(device))
return sparse_rows
User split configuration
# Seen training users (0.5% of prompts observed per user)
N = 1000
all_feature_diff = simulate_population(reward_tensor, features, W)
train_features = create_sparse_tensor(all_feature_diff, 0.005)
# Seen users evaluated on unseen prompts (100% of test prompts)
all_feature_diff_test = simulate_population(reward_tensor_test, test_features, W)
test_features_sparse = create_sparse_tensor(all_feature_diff_test, 1.0)
# Unseen users — few-shot calibration
N_unseen = 500
W_unseen = generate_popupulation(alpha, N_unseen)
all_feature_diff_unseen = simulate_population(reward_tensor, features, W_unseen)
train_features_unseen = create_sparse_tensor(all_feature_diff_unseen, 0.001)
# Unseen users evaluated on unseen prompts
all_feature_diff_test_unseen = simulate_population(reward_tensor_test, test_features, W_unseen)
test_features_sparse_unseen = create_sparse_tensor(all_feature_diff_test_unseen, 1.0)
The 0.5% training density (0.005) and 0.1% few-shot density (0.001)
deliberately simulate sparse real-world feedback — most users rate only a
tiny fraction of prompts.
Evaluation grid
K_list = [0, 1, 2, 3, 4, 5]
alpha_list = [0]
K=0 uses the pre-trained reward head as a fixed reference. K=1 is the
single-vector Bradley-Terry baseline. K≥2 are full LoRe runs with increasing
basis rank.
Running the full pipeline
Install dependencies
pip install -r requirements.txt
Compute and cache embeddings
cd LoRe/PersonalLLM
python prepare.py
This generates train.safetensors and (when run on the test split)
test.safetensors in the current directory. Requires a GPU with ~16 GB
VRAM for Skywork-Reward-Llama-3.1-8B-v0.2 with flash_attention_2.Train the reward basis
Trains V and W jointly for each K in K_list and prints per-rank
accuracy across all four evaluation settings.