Reddit TLDR: summarization preference dataset

Reddit TLDR brings real annotator diversity into LoRe. The dataset is built from OpenAI’s summarize_from_feedback benchmark, where crowd workers rated pairs of Reddit post summaries. Because each worker has a distinct annotation style, grouping by worker_id yields a natural multi-user setup: every worker is one “user”, and their preference pairs form that user’s training signal. LoRe learns a shared low-rank reward basis that captures what summarization quality means across this heterogeneous population.

Source dataset

The raw data is loaded directly from HuggingFace using the comparisons split:

from datasets import load_dataset

dataset = load_dataset("openai/summarize_from_feedback", 'comparisons')
df = pd.DataFrame(dataset['train'])

Each row in the comparisons split contains:

worker — the annotator’s unique ID
info.post — the Reddit post text
summaries — a list of two candidate summaries
choice — index (0 or 1) of the summary the worker preferred

Worker grouping

prepare.py iterates over every row and groups preference pairs by worker_id. Each worker becomes one user in LoRe’s multi-user setup:

worker_results = {}

for index, row in df.iterrows():
    worker_id = row['worker']

    text = row['info']['post']
    summaries = row['summaries']
    winning_summary = summaries[row['choice']]['text']
    losing_summary = summaries[1 - row['choice']]['text']

    if worker_id not in worker_results:
        worker_results[worker_id] = []
    worker_results[worker_id].append({
        'text': text,
        'winning_summary': winning_summary,
        'losing_summary': losing_summary
    })

The choice index directly selects the preferred summary; 1 - row['choice'] selects the rejected one. Workers are then sorted by the number of annotations they contributed.

Embedding extraction

Embeddings are generated using Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 loaded with flash_attention_2 and bfloat16 precision. Both the winning and losing conversations are formatted as chat templates before encoding:

device = "cuda:0"
model_name = "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"
rm = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
    attn_implementation="flash_attention_2",
    num_labels=1,
)
rm_tokenizer = AutoTokenizer.from_pretrained(model_name)

for worker_id, data in worker_results.items():
    for entry in data:
        conv_winning = [
            {"role": "user", "content": entry['text']},
            {"role": "assistant", "content": entry['winning_summary']}
        ]
        conv_losing = [
            {"role": "user", "content": entry['text']},
            {"role": "assistant", "content": entry['losing_summary']}
        ]
        inputs_winning = rm_tokenizer.apply_chat_template(conv_winning, return_tensors="pt").to(device)
        inputs_losing  = rm_tokenizer.apply_chat_template(conv_losing,  return_tensors="pt").to(device)

        with torch.no_grad():
            embedding_winning = rm(inputs_winning).last_hidden_state[0][-1].cpu()
            embedding_losing  = rm(inputs_losing).last_hidden_state[0][-1].cpu()

The feature used during training is the difference embedding_winning - embedding_losing, computed later in train_basis.py:

x = data[i]['embeddings']['winning'][0] - data[i]['embeddings']['losing'][0]

The last hidden state of the last token (last_hidden_state[0][-1]) is the reward-relevant representation for this model architecture. Do not use pooled outputs or mean-pooling.

Output files

prepare.py saves a pickle file mapping each worker_id to a list of annotated entries with embeddings:

with open('tldr_embeddings_train.pkl', 'wb') as f:
    pickle.dump(results, f)

The resulting structure is:

{
  worker_id: [
    {
      'text': str,
      'winning_summary': str,
      'losing_summary': str,
      'embeddings': {
        'winning': [Tensor],   # shape: [hidden_dim]
        'losing':  [Tensor]
      }
    },
    ...
  ],
  ...
}

Run prepare.py twice — once for dataset['train'] (saving tldr_embeddings_train.pkl) and once for dataset['validation'] (saving tldr_embeddings_val.pkl).

Training setup

train_basis.py splits the common workers between train and validation sets 50/50 (by worker_id), caps training pairs per user, and calls run():

# Seen users: up to 150 training pairs each
T = min(len(data), 150)

# Unseen users: up to 50 few-shot pairs each
T_unseen = min(len(data), 50)

K_list = [0, 1, 2, 3, 4, 5, 6]
alpha_list = [0]

K=0 uses the base Skywork reward head directly (reference model).
K=1 is equivalent to a single Bradley-Terry model.
K=2..6 are the low-rank LoRe models with increasing basis size.

The 50/50 seen/unseen worker split is determined by taking the intersection of workers present in both tldr_embeddings_train.pkl and tldr_embeddings_val.pkl, shuffling with random.seed(0), and splitting at the midpoint.

Run commands

Install dependencies

pip install -r requirements.txt

Prepare the dataset (one-time)

cd LoRe/RedditTLDR
python prepare.py

This downloads openai/summarize_from_feedback, extracts embeddings for every worker’s preference pairs, and writes tldr_embeddings_train.pkl.

Train the reward model basis

python train_basis.py

Trains LoRe across ranks K = [0, 1, 2, 3, 4, 5, 6] and reports accuracy on seen and unseen users.

Evaluate few-shot personalization

python vary_fewshot.py

Sweeps over different numbers of few-shot examples for unseen users and reports generalization curves.

Embedding generation in prepare.py requires a GPU with sufficient VRAM to load Skywork-Reward-Llama-3.1-8B-v0.2 in bfloat16. This is a one-time cost — once tldr_embeddings_train.pkl and tldr_embeddings_val.pkl exist, training and evaluation runs do not reload the large model.

Get Started

Concepts

Datasets

Training & Evaluation

Reddit TLDR: summarization preference dataset

Source dataset

Worker grouping

Embedding extraction

Output files

Training setup

Run commands

Build docs developers (and LLMs) love

Get Started

Concepts

Datasets

Training & Evaluation

Documentation Index

​Source dataset

​Worker grouping

​Embedding extraction

​Output files

​Training setup

​Run commands

Build docs developers (and LLMs) love

Source dataset

Worker grouping

Embedding extraction

Output files

Training setup

Run commands