Reddit TLDR brings real annotator diversity into LoRe. The dataset is built from OpenAI’sDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt
Use this file to discover all available pages before exploring further.
summarize_from_feedback benchmark, where crowd workers rated pairs of Reddit post summaries. Because each worker has a distinct annotation style, grouping by worker_id yields a natural multi-user setup: every worker is one “user”, and their preference pairs form that user’s training signal. LoRe learns a shared low-rank reward basis that captures what summarization quality means across this heterogeneous population.
Source dataset
The raw data is loaded directly from HuggingFace using thecomparisons split:
comparisons split contains:
worker— the annotator’s unique IDinfo.post— the Reddit post textsummaries— a list of two candidate summarieschoice— index (0 or 1) of the summary the worker preferred
Worker grouping
prepare.py iterates over every row and groups preference pairs by worker_id. Each worker becomes one user in LoRe’s multi-user setup:
choice index directly selects the preferred summary; 1 - row['choice'] selects the rejected one. Workers are then sorted by the number of annotations they contributed.
Embedding extraction
Embeddings are generated usingSkywork/Skywork-Reward-Llama-3.1-8B-v0.2 loaded with flash_attention_2 and bfloat16 precision. Both the winning and losing conversations are formatted as chat templates before encoding:
embedding_winning - embedding_losing, computed later in train_basis.py:
The last hidden state of the last token (
last_hidden_state[0][-1]) is the reward-relevant representation for this model architecture. Do not use pooled outputs or mean-pooling.Output files
prepare.py saves a pickle file mapping each worker_id to a list of annotated entries with embeddings:
prepare.py twice — once for dataset['train'] (saving tldr_embeddings_train.pkl) and once for dataset['validation'] (saving tldr_embeddings_val.pkl).
Training setup
train_basis.py splits the common workers between train and validation sets 50/50 (by worker_id), caps training pairs per user, and calls run():
K=0uses the base Skywork reward head directly (reference model).K=1is equivalent to a single Bradley-Terry model.K=2..6are the low-rank LoRe models with increasing basis size.
Run commands
Prepare the dataset (one-time)
openai/summarize_from_feedback, extracts embeddings for every worker’s preference pairs, and writes tldr_embeddings_train.pkl.Train the reward model basis
K = [0, 1, 2, 3, 4, 5, 6] and reports accuracy on seen and unseen users.