Benchmark datasets for preference learning

LoRe ships with three benchmark datasets that span summarization, multi-turn dialogue, and open-ended generation. Each dataset exposes a different flavor of the multi-user reward learning problem: real crowd-worker annotations, held-out naturalistic dialogue users, and large synthetic user populations. Picking the right dataset depends on whether you need realistic annotation noise, conversation context, or controlled scalability experiments. All three datasets share the same embedding backbone — Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 — extracting the last hidden state of the final token to produce fixed-size preference features that feed into LoRe’s low-rank reward basis.

Dataset comparison

Dataset	Domain	Users	Data Source	Unique Aspect
Reddit TLDR	Summarization quality	Crowd workers	`openai/summarize_from_feedback` on HuggingFace	Diverse real annotator preferences across workers
PRISM	Multi-turn dialogue	PRISM dataset users	PRISM dataset (HannahRoseKirk/prism-alignment)	Seen/unseen user split derived directly from data metadata
PersonalLLM	Open-ended LLM responses	Simulated via Dirichlet mixture	`namkoong-lab/PersonalLLM` on HuggingFace	Synthetic user population generation at configurable scale

Shared embedding approach

Every dataset relies on Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 to convert raw text into preference embeddings. The reward model is loaded with torch_dtype=torch.bfloat16 and flash_attention_2 (or eager for PRISM), and the last hidden state of the last token is used as the feature vector for each conversation. Preference features are then computed as the difference between the winning and losing response embeddings, making the representation directly suitable for Bradley-Terry style reward learning.

All embedding extraction requires a CUDA GPU. Embedding generation is a one-time preprocessing step — run prepare.py once and reuse the cached output for all subsequent training runs.

When to use each dataset

Reddit TLDR — best for benchmarking on real human annotation diversity. Workers vary substantially in what constitutes a good summary, giving LoRe a realistic multi-user signal with natural noise.
PRISM — best for multi-turn dialogue settings where conversation history matters. The seen/unseen split is defined by the original dataset metadata, making evaluation clean and reproducible. Use eval_rb2.py alongside training to catch overfitting.
PersonalLLM — best for scalability experiments. The Dirichlet-mixture user simulation lets you generate arbitrarily large synthetic populations (e.g. N=1000 train, N=500 unseen) and control preference diversity via the alpha concentration parameter.

Reddit TLDR

Real crowd-worker summarization preferences from OpenAI’s feedback dataset, grouped by worker ID.

PRISM

Multi-turn dialogue preferences with seen/unseen user splits derived from PRISM alignment metadata.

PersonalLLM

Synthetic user populations sampled as Dirichlet mixtures over 10 reward models at configurable scale.

Get Started

Concepts

Datasets

Training & Evaluation

Benchmark datasets for preference learning

Dataset comparison

Shared embedding approach

When to use each dataset

Reddit TLDR

PRISM

PersonalLLM

Build docs developers (and LLMs) love

Get Started

Concepts

Datasets

Training & Evaluation

Documentation Index

​Dataset comparison

​Shared embedding approach

​When to use each dataset

Reddit TLDR

PRISM

PersonalLLM

Build docs developers (and LLMs) love

Dataset comparison

Shared embedding approach

When to use each dataset