LoRe ships with three benchmark datasets that span summarization, multi-turn dialogue, and open-ended generation. Each dataset exposes a different flavor of the multi-user reward learning problem: real crowd-worker annotations, held-out naturalistic dialogue users, and large synthetic user populations. Picking the right dataset depends on whether you need realistic annotation noise, conversation context, or controlled scalability experiments. All three datasets share the same embedding backbone — Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 — extracting the last hidden state of the final token to produce fixed-size preference features that feed into LoRe’s low-rank reward basis.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/LoRe/llms.txt
Use this file to discover all available pages before exploring further.
Dataset comparison
| Dataset | Domain | Users | Data Source | Unique Aspect |
|---|---|---|---|---|
| Reddit TLDR | Summarization quality | Crowd workers | openai/summarize_from_feedback on HuggingFace | Diverse real annotator preferences across workers |
| PRISM | Multi-turn dialogue | PRISM dataset users | PRISM dataset (HannahRoseKirk/prism-alignment) | Seen/unseen user split derived directly from data metadata |
| PersonalLLM | Open-ended LLM responses | Simulated via Dirichlet mixture | namkoong-lab/PersonalLLM on HuggingFace | Synthetic user population generation at configurable scale |
Shared embedding approach
Every dataset relies onSkywork/Skywork-Reward-Llama-3.1-8B-v0.2 to convert raw text into preference embeddings. The reward model is loaded with torch_dtype=torch.bfloat16 and flash_attention_2 (or eager for PRISM), and the last hidden state of the last token is used as the feature vector for each conversation. Preference features are then computed as the difference between the winning and losing response embeddings, making the representation directly suitable for Bradley-Terry style reward learning.
All embedding extraction requires a CUDA GPU. Embedding generation is a one-time preprocessing step — run
prepare.py once and reuse the cached output for all subsequent training runs.When to use each dataset
- Reddit TLDR — best for benchmarking on real human annotation diversity. Workers vary substantially in what constitutes a good summary, giving LoRe a realistic multi-user signal with natural noise.
- PRISM — best for multi-turn dialogue settings where conversation history matters. The seen/unseen split is defined by the original dataset metadata, making evaluation clean and reproducible. Use
eval_rb2.pyalongside training to catch overfitting. - PersonalLLM — best for scalability experiments. The Dirichlet-mixture user simulation lets you generate arbitrarily large synthetic populations (e.g. N=1000 train, N=500 unseen) and control preference diversity via the
alphaconcentration parameter.
Reddit TLDR
Real crowd-worker summarization preferences from OpenAI’s feedback dataset, grouped by worker ID.
PRISM
Multi-turn dialogue preferences with seen/unseen user splits derived from PRISM alignment metadata.
PersonalLLM
Synthetic user populations sampled as Dirichlet mixtures over 10 reward models at configurable scale.