Prepare datasets for LoRe training

Before training, each dataset must be preprocessed into embedding files that train_basis.py can load directly. The prepare.py script in each dataset subdirectory handles this step: it downloads or reads raw preference data, tokenizes every prompt–response pair using Skywork/Skywork-Reward-Llama-3.1-8B-v0.2, extracts the final hidden-state vector from the last token position, and writes the resulting embeddings to disk. This only needs to run once — all downstream training scripts read from the cached output files.

All three prepare.py scripts load Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 onto a GPU. Expect ~16 GB VRAM usage. The model is loaded with torch_dtype=torch.bfloat16 and attn_implementation="flash_attention_2".

How the model is loaded

Every prepare.py uses the same loading pattern:

import torch
from transformers import AutoModel, AutoTokenizer

device     = "cuda:0"
model_name = "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"

rm = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
    attn_implementation="flash_attention_2",
    num_labels=1,
)
rm_tokenizer = AutoTokenizer.from_pretrained(model_name)

The last hidden state of the last token is extracted without gradients:

with torch.no_grad():
    output = rm(conv_tokenized)
    embeddings.append(output.last_hidden_state[0][-1].cpu())

Dataset-specific instructions

RedditTLDR
PRISM
PersonalLLM

Input data

The RedditTLDR prepare.py reads the openai/summarize_from_feedback dataset (or a local equivalent) and pairs each Reddit post with its human-preferred and human-rejected summary.Each preference pair is structured as a conversation:

conv = [
    {"role": "user",      "content": post_text},
    {"role": "assistant", "content": summary_text},
]

What `prepare.py` does

For every winning and losing summary in the training and validation splits, the script tokenizes the conversation with apply_chat_template and extracts the last hidden-state vector. Results are grouped by worker ID (annotator) and pickled:

tldr_embeddings_train.pkl
tldr_embeddings_val.pkl

Each pickle is a dictionary keyed by worker_id. Each value is a list of dicts with an "embeddings" key containing "winning" and "losing" embedding arrays.

Run command

cd LoRe/RedditTLDR
python prepare.py

Input data

PRISM is a multi-turn dialogue dataset. Its preparation is a two-step process: first prepare.py formats the raw PRISM data files into chat format, then generate-prism-embeddings.py runs the reward model over every conversation pair.

What the scripts do

prepare.py reads from the local PRISM data files and structures each dialogue as a list of chat turns. generate-prism-embeddings.py then iterates over chosen and rejected conversation completions, computes embeddings, and writes them to:

data/prism/train_embeddings.pkl
data/prism/test_embeddings.pkl

train_basis.py loads these with torch.load:

train_embeddings = torch.load("data/prism/train_embeddings.pkl")
test_embeddings  = torch.load("data/prism/test_embeddings.pkl")

Each entry includes an extra_info dict with user_id, seen (bool), and split fields used to partition users into seen/unseen groups during training. Chosen and rejected embeddings are stored under extra_info["chosen_conv_embedding"] and extra_info["rejected_conv_embedding"].

Run command

cd LoRe/PRISM
python prepare.py
python generate-prism-embeddings.py

PRISM’s attn_implementation is set to "eager" (not "flash_attention_2") in train_basis.py. Verify your setup supports the attention backend before running.

Input data

PersonalLLM is loaded directly from HuggingFace — no local data files needed:

from datasets import load_dataset

dataset   = load_dataset("namkoong-lab/PersonalLLM")
data      = pd.DataFrame(dataset["train"])
data_test = pd.DataFrame(dataset["test"])

Each row contains a prompt and eight response columns (response_1 through response_8).

What `prepare.py` does

It builds a conversation list for every (prompt, response) pair, runs each through the reward model, and stacks the extracted embeddings into a single tensor:

for example in dataset:
    for i in range(len(example)):
        conv_tokenized = rm_tokenizer.apply_chat_template(
            example[i], tokenize=True, return_tensors="pt"
        ).to(device)
        with torch.no_grad():
            output = rm(conv_tokenized)
            embeddings.append(output.last_hidden_state[0][-1].cpu())

embeddings = torch.stack(embeddings, dim=0)

The output is saved via safetensors:

from safetensors.torch import save_file

save_file({"embeddings": embeddings}, "./train.safetensors")

Run once for the training split; the test split is handled analogously and produces test.safetensors.

Output files

File	Contents
`train.safetensors`	Embeddings for all training prompt–response pairs
`test.safetensors`	Embeddings for all test prompt–response pairs

Run command

cd LoRe/PersonalLLM
python prepare.py

prepare.py only needs to run once per dataset. The output files are read directly by train_basis.py and vary_fewshot.py on every subsequent run. Re-running prepare.py will overwrite the cached files.

Get Started

Concepts

Datasets

Training & Evaluation

Prepare datasets for LoRe training

How the model is loaded

Dataset-specific instructions

Input data

What `prepare.py` does

Run command

Input data

What the scripts do

Run command

Input data

What `prepare.py` does

Output files

Run command

Build docs developers (and LLMs) love

Get Started

Concepts

Datasets

Training & Evaluation

Documentation Index

​How the model is loaded

​Dataset-specific instructions

​Input data

​What prepare.py does

​Run command

​Input data

​What the scripts do

​Run command

​Input data

​What prepare.py does

​Output files

​Run command

Build docs developers (and LLMs) love

How the model is loaded

Dataset-specific instructions

Input data

What `prepare.py` does

Run command

Input data

What the scripts do

Run command

Input data

What `prepare.py` does

Output files

Run command